Enjoy, you’re doing it wrong!


Repeat after me: Unicode is not UTF-\d{1,2}
June 22, 2009, 5:23 am
Filed under: Uncategorized

Stop. Unicode is not UTF-16, Unicode is not UTF-32, and Unicode is not UTF-8. Stop it.

This is just plainly wrong. Unicode is a standard, and UTF-(8|16|32) are character encodings that support encoding Unicode data into byte strings. UTF-16 is capable of encoding the whole Unicode repertoire, but UTF-16 is not Unicode, and Unicode is not UTF-16.

And no, it’s not almost the same. It’s nowhere to be close. Think about it, you take some beginners, and introduce them to programming. After a while they get the pleasure to be introduced to “Ãœ” “ö” “?” and their friends, and it’s time for them to learn about the huge Unicode monster. You point them to a great article, wait a bit, and head for a little Q&A session.

Inevitably, you will get strange formulations such as “I see, I will encode everything to Unicode”, or “but how can I tell if this string is Unicode-encoded?”, or better “all this time I’ve been using UTF-8 without knowing that it was in fact Unicode magic?”. Well, If you think that these approximate formulations are fine, and that you should not correct your beginner programmer, you’re just doing it wrong.

A byte string cannot be Unicode-encoded. It cannot. You either work with encoded byte strings, or with unicode data (a sequence of code-points). But you can’t “Unicode-encode”, it’s non-sense.
Similarly you cannot encode a string to Unicode. You cannot. You decode a byte string from a source character set to Unicode data: code-point sequences.

Repeat after me:

  • when transforming data from Unicode to a specific character set, you are encoding Unicode codepoints into bytes, according to a specific character encoding that suits your target character set.
  • when transforming from (byte) strings to Unicode, you are decoding your data: if you fail to provide the right character encoding to decode from, you will end up with borked Unicode data.

It’s not “okay” to let beginners work with approximate knowledge of the Unicode buzzword. They will eventually get confused, and you will end up losing time re-explaining over and over the same things. Approximate formulations reflect approximate knowledge, and you should not let that be. Approximate Unicode knowledge is the blocker, the main reason why everything we do is not (yet) in Unicode.

Because of these kind of approximations, we had broken Unicode support in Python until Python 3.0, where Unicode data and byte strings were deriving from a common class. Because of these kind of approximations, we have hundreds of beginners not understand the difference between UTF-8 and Unicode, and not understanding why string.encode('utf-8') can throw an error: you see, you just said that it was okay to “Unicode-encode”, and that UTF-8 is Unicode, so basically they are trying to “encode” strings as… Unicode and the fact that it fails is just puzzling them because Unicode was supposed to be the magical remedy to all their encoding errors.

Because of these approximations, the .NET property Encoding.Unicode is the shortcut for what should be Encoding.UTF16. There are Encoding.UTF8, Encoding.UTF32, and Encoding.ASCII, and in the middle of those… Encoding.Unicode. How can developers write such wrong things? Unicode is not an Encoding, UTF16 is not Unicode. Just look at the wonderful C# construct Encoding u16LE = Encoding.Unicode; taken directly from the documentation: congratulations, you are assigning an “Unicode” value to to an “Encoding” type. Crystal clear.

A good image, perhaps, to explain the fundamental type difference between Unicode and, let’s say, UTF-16, would be to assimilate Unicode as an Interface, and UTF-16 as a concrete class implementing the Unicode interface.

In one hand, Unicode does not define any implementation: it defines no data representation, only an international unequivocal way to associate a character to a code-point. You could store a list of code-points to represent Unicode data, yes, but doing this forces you to store 4 bytes per character because of the large code-point range. This is rather inefficient, and this is why UTF-* appeared. The whole idea is to map Unicode data to byte strings, choosing yourself the mapping function, so that the resulting representation fits your needs. In a way, you have many different strategies implementing the same interface, depending on your focus:

  • UTF-8 focuses on minimizing the byte size for representation of characters from the ASCII set (variable length representation: each character is represented on 1 to 4 bytes, and ASCII characters all fit on 1 byte). As Joel puts it:

    “Look at all those zeros!” they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn’t have minded guzzling twice the number of bytes. But those Californian wimps couldn’t bear the idea of doubling the amount of storage it took for strings

  • UTF-32 focuses on exhaustiveness and fixed-length representation, using 4 bytes for all characters. It’s the most straightforward translation, mapping directly the Unicode code-point to 4 bytes. Obviously, it’s not very size-efficient.
  • UTF-16 is a compromise, using 2 bytes most of the time, but expanding to 2 * 2 bytes per character to represent certain characters, those not included in the Basic Multilingual Plane (BMP).

I’m not writing that every developer should know about these details, I’m just writing that you should know about that very basic interface/implementation difference: Unicode is just an intermediate representation, not an actual encoding. I find it awful that .NET is so misleading here. Also, UTF-(8|16|32) are the most famous “implementations”, but there are several other standards, such as UTF-EBCDIC

If you’re an experienced programmer, please do not allow those approximations. If a code “guru” writes those sort of shortcuts, how will behaviors ever change about Unicode? Unicode is not black magic, you just have to be precise, and people will understand. Please use the right words. Over-simplifying concepts is not helping. Thanks.

About these ads

24 Comments so far
Leave a comment

Love the 1-point type! I’ve set my browser to text size “Largest”, but I guess you know best.

Comment by TC

Sorry. I would increase the size/font, but WP.com will not let me customize the CSS without paying. I can only pick templates and change the header. Heh, what did I expect? It’s free.

Comment by enjoydoingitwrong

Use Blogger then ;-) The text on this page was too small for me to read too.

Comment by Chris

May I suggest getting yourself a browser that can resize text irrelevant of how the stylesheet is specified? Anything other than Internet Explorer.

Comment by Jem

How about instead of whining publicly, you use a browser that isn’t crap and set the font to your personal preference. This is this guy’s website, he can set the font however he likes. You have a browser, and you can set the font however you like. Help yrself buddy.

Comment by Charlie

I’m viewing his website on *my* PC – not his, and definitely not your – so I’ll use whatever browser I want.

Comment by TC

I think you have two simple options here: Either use another browser (everything except MSIE should be fine), or just copy-n-paste the text to any text editor or word processor you have handy. I would suggest Notepad++, where you are able to increase/decrease the font size in a huge range with a single leftclick.

P.S.: Newer MSIE versions also support zoom as an alternative to changing the font size. The font size only changes the fonts, the zoom function zooms the whole document (including images etc.). The latter is always possible, no matter what CSS attributes are used in the document. You should REALLY try this feature if you want to / have to use MSIE.

So stop whining.

Comment by rolfhub

Joel is a buffoon.

http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

Who would you rather have program your life support Spolsky or Thompson ?

Comment by maht

Nice post, and I completely agree.
However, you are *losing* (not “loosing”) time.

Comment by Erlend

[...] a related note from the programming sub reddit, two must read articles on Unicode: Repeat after me: Unicode is not UTF-d{1,2} and the quite famous The Absolute Minimum Every Software Developer Absolutely, Positively Must Know [...]

Pingback by handling javascript encodeURIComponent in Perl « rook's Tutorials and Notes

> Repeat after me:

You’re right. But I don’t like your tone. Neither do my students. Better try in didactics next time.

Comment by dejj

I love the tone. Gets right to the point: Do it right, nothing else will be excepted.

Comment by MattH

Agreed. I *want* to understand, but I kept being thrown by the stern lecturer venting rather educating. I guess it is a blog. Off to read a book.

Comment by chuckles

I agree dejj. It’s obnoxious almost.

Comment by Phier

I don’t like your tone either. And you don’t have the concepts very clear, you’re making a mess of it all for yourself and others. UTF is a Unicode Consortium spec for transformation of Unicode-standard strings. UTF-\d+ IS Unicode like a Ford Focus is a car. Not every car is a Ford Focus. Thus UTF-8 IS Unicode, not vice-versa.

Comment by Ze

I can still remember hearing other nerds bemoaning the blurring of meaning between Baud and BPS. And it was just as pointless a rant.

Comment by Wolter

nice article. do you mind if i translate it into Chinese and post the translation on my blog?

Comment by nil

no, it’s fine, as long as you present it as a translation of course. Thanks for asking.

Comment by enjoydoingitwrong

Great content in the article, but like the other readers I find the tone a bit condescending.

Unicode is a fairly complicated topic and is not taught very well or at all in computer science programs. Most English speaking programmers were only vaguely aware of it until it bit us in the ass one day.

I’d love to see more entry level articles on Unicode. You clearly understand the topic and should write more. Joel’s now famous article is a nice start, but I don’t think it really goes far enough.

Comment by johnfx

[...] Repeat after me: Unicode is not UTF-d{1,2} — много текста про то, что Unicode и UTF-xx — разные вещи. [...]

Pingback by Amazon byteflow: Юникод в Python 2.x - что это и зачем?

[...] also made me remember that I’d had this post in my reading list for a long time. In essence the point being made is that complying to Unicode [...]

Pingback by The Funny Character Taskforce Rides again! « Content Negotiable

Excellent exposition of how things are. I think Joel should put a link on his article to yours.

BTW, I’ve been confused many times using unicode() on Python 2.X, but I find Python’s one of the most convenient ways to handle Unicode, comparing to most mainstream languages (barring Py3).

Comment by Vinko

[...] Repeat after me: Unicode is not UTF-d{1,2} — много текста про то, что Unicode и UTF-xx — разные вещи [...]

Pingback by Юникод в Python 2.x – что это и зачем? « Всё о Linux




Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s



Follow

Get every new post delivered to your Inbox.

%d bloggers like this: