This is a quick post I threw together on the big differences with how Python 2 and Python 3 handle byte strings and Unicode. This is mainly a reference, but maybe it will also help someone out.
Everything in the computer is a byte
Its important to remember that in the end, everything we deal with is a binary byte (sequence of 0s and 1s) when it is stored, transmitted, or computed in any normal computing system on the planet.
ASCII vs Unicode
Numbers (bytes) mean nothing on their own and so back at the beginning of computing everyone agreed that when indicated, certain numbers would represent certain characters. The decimal number 65 (Binary 1000001) would represent the character A, etc.
It was agreed that a byte (8 bits) would be reserved to store characters. This gives us 2^8 possible arrangements of bits that we can use to form 256 different characters when we use a byte to store a character, as the ASCII encoding laid out.
256 different characters worked well until everyone realized that there were thousands of other characters in other languages that should also be supported. Unicode was then invented which reserves up to 4 bytes for each character allowing for more than a million valid code points.
What is a code point
This is a term used in character encoding. Basically, a code point is a numerical value that represents a single character.
- The ASCII character encoding standard has 128 valid code points.
- The extended ASCII character encoding standard has 256 valid code points
- The Unicode character encoding UTF-8 has up to 1,112,064 valid code points.
Python 2 vs Python 3 String Handling
|Python 2||this string literal is called a "str" object but its stored as bytes. If you prefix it with "u" you get a "unicode" object which is stored as Unicode code points.|
|Python 3||this string literal is a "str" object that stores Unicode code points by default. You can prefix it with "b" to get a bytes object or use .encode.|
Some brief code examples
# Python 2 >>> print type("Hello World!") <type 'str'> # this is a byte string >>>print type(u"Hello World!") <type 'unicode'> # this is a Unicode string
# Python 3 >>> print(type("Hello World!")) <class 'str'> # this is a Unicode string >>> print(type(b"Hello World!")) <class 'bytes'> # this is a byte string
So, if everything in a computer is a byte, then what Exactly is a 'unicode' string in Python and why do they differ so much from byte strings!?
So, I found this very confusing at first. I knew everything in a computer boiled down to bytes and started wondering, what exactly does it mean when something get's converted from bytestrings to 'unicode' in Python? In reality it must be being stored as a bytestring... So what exactly is going on?
I did a lot of searching on this and couldn't find any good info until I hit upon a good Stack Overflow post on the topic: How is unicode represented internally in Python?
This post helped me realize that what is actually going on is that the sequence of bytes (byte string) is being converted from "whatever we tell Python it is" (more on this next) to Python's own internal representation of Unicode Strings that it uses for all Unicode Strings. Basically it converts it from bytes to Python's favorite encoding that it uses.
Please see the post for a much better explanation!
There is no way to determine what type of encoding byte strings are
So, if your crawling an API or scraping a website, you have to be told what the encoding is of the strings you are getting. Hopefully, this will be in the documentation, or if you are dealing with properly marked up files then you will see the encoding specified usually at the top of the file.
Just know that there is no way to look at a byte string and to reliably determine the encoding.
Using the Unicode Sandwich model to handle Unicode
As Ned Batchelder talks about in his Unicode talk, the basic idea of the Unicode Sandwich is that you will get your byte strings into your program and as soon as possible decode them into Unicode strings. All of the manipulation and processing of those strings in your program will be done while they are in unicode form. Then, on output they should be encode back into byte strings and sent on their way.
Many libraries and frameworks in Python 3 make following this model very easy as they often give you input from files or the web already decoded to Unicode strings and then allow you to pass output into functions as Unicode strings where they handle the encoding back to bytes.
Where you may run into problems is using lower level libraries that leverage sockets. Some of those socket methods are designed to only take byte strings. In Python 2, they would accept as input a "str" like
"Hello World!" because it was actually a byte string in Python 2.
In Python 3 you will need to encode the Unicode string into byte string with a method like
"Hello World!".encode('UTF-8') when you pass it into the socket function/method or you will get an error telling you that it only takes byte strings.
Pragmatic Unicode - This is a link to Ned Batchelder's awesome talk/article about how unicode is handle in Python. It's a very informative watch.
Unicode strings in Python: A basic tutorial - Philip Guo has a great tutorial about Unicode strings in Python.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel Spolsky's legendary article explaining what Unicode is and why its important.
As usual, feel free to comment below or contact me if you come across any errors in this post!*