Tuesday, 10 September 2013

Confused about unicode representations

Confused about unicode representations

I am confused about hex representation of Unicode. I have an example file
with a single mathematical integral sign character in it. That is U+222B
If I cat the file or edit it in vi I get an integral sign displayed. A hex
dump of the file shows its hex content is 88e2 0aab
In python I can create an integral unicode character and print p rendering
on my terminal and integral sign.
>>> p=u'\u222b'
>>> p
u'\u222b'
>>> print p
ç
What confuses me is I can open a file with the integral sign in it, get
the integral symbol but the hex content is different.
>>> c=open('mycharfile','r').read()
>>> c
'\xe2\x88\xab\n'
>>> print c
ç
One is a Unicode object and one is a plain string but what is the
relationship between the two hex codes apparently for the same character?
How would I manually convert one to another?

No comments:

Post a Comment