Filed under: Uncategorized
Stop. Unicode is not UTF-16, Unicode is not UTF-32, and Unicode is not UTF-8. Stop it.
This is just plainly wrong. Unicode is a standard, and UTF-(8|16|32) are character encodings that support encoding Unicode data into byte strings. UTF-16 is capable of encoding the whole Unicode repertoire, but UTF-16 is not Unicode, and Unicode is not UTF-16.
And no, it’s not almost the same. It’s nowhere to be close. Think about it, you take some beginners, and introduce them to programming. After a while they get the pleasure to be introduced to “Ü” “ö” “?” and their friends, and it’s time for them to learn about the huge Unicode monster. You point them to a great article, wait a bit, and head for a little Q&A session.
Inevitably, you will get strange formulations such as “I see, I will encode everything to Unicode”, or “but how can I tell if this string is Unicode-encoded?”, or better “all this time I’ve been using UTF-8 without knowing that it was in fact Unicode magic?”. Well, If you think that these approximate formulations are fine, and that you should not correct your beginner programmer, you’re just doing it wrong.
A byte string cannot be Unicode-encoded. It cannot. You either work with encoded byte strings, or with unicode data (a sequence of code-points). But you can’t “Unicode-encode”, it’s non-sense.
Similarly you cannot encode a string to Unicode. You cannot. You decode a byte string from a source character set to Unicode data: code-point sequences.
Repeat after me:
- when transforming data from Unicode to a specific character set, you are encoding Unicode codepoints into bytes, according to a specific character encoding that suits your target character set.
- when transforming from (byte) strings to Unicode, you are decoding your data: if you fail to provide the right character encoding to decode from, you will end up with borked Unicode data.
It’s not “okay” to let beginners work with approximate knowledge of the Unicode buzzword. They will eventually get confused, and you will end up losing time re-explaining over and over the same things. Approximate formulations reflect approximate knowledge, and you should not let that be. Approximate Unicode knowledge is the blocker, the main reason why everything we do is not (yet) in Unicode.
Because of these kind of approximations, we had broken Unicode support in Python until Python 3.0, where Unicode data and byte strings were deriving from a common class. Because of these kind of approximations, we have hundreds of beginners not understand the difference between UTF-8 and Unicode, and not understanding why string.encode('utf-8') can throw an error: you see, you just said that it was okay to “Unicode-encode”, and that UTF-8 is Unicode, so basically they are trying to “encode” strings as… Unicode and the fact that it fails is just puzzling them because Unicode was supposed to be the magical remedy to all their encoding errors.
Because of these approximations, the .NET property Encoding.Unicode is the shortcut for what should be Encoding.UTF16. There are Encoding.UTF8, Encoding.UTF32, and Encoding.ASCII, and in the middle of those… Encoding.Unicode. How can developers write such wrong things? Unicode is not an Encoding, UTF16 is not Unicode. Just look at the wonderful C# construct Encoding u16LE = Encoding.Unicode; taken directly from the documentation: congratulations, you are assigning an “Unicode” value to to an “Encoding” type. Crystal clear.
A good image, perhaps, to explain the fundamental type difference between Unicode and, let’s say, UTF-16, would be to assimilate Unicode as an Interface, and UTF-16 as a concrete class implementing the Unicode interface.
In one hand, Unicode does not define any implementation: it defines no data representation, only an international unequivocal way to associate a character to a code-point. You could store a list of code-points to represent Unicode data, yes, but doing this forces you to store 4 bytes per character because of the large code-point range. This is rather inefficient, and this is why UTF-* appeared. The whole idea is to map Unicode data to byte strings, choosing yourself the mapping function, so that the resulting representation fits your needs. In a way, you have many different strategies implementing the same interface, depending on your focus:
- UTF-8 focuses on minimizing the byte size for representation of characters from the ASCII set (variable length representation: each character is represented on 1 to 4 bytes, and ASCII characters all fit on 1 byte). As Joel puts it:
“Look at all those zeros!” they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn’t have minded guzzling twice the number of bytes. But those Californian wimps couldn’t bear the idea of doubling the amount of storage it took for strings
- UTF-32 focuses on exhaustiveness and fixed-length representation, using 4 bytes for all characters. It’s the most straightforward translation, mapping directly the Unicode code-point to 4 bytes. Obviously, it’s not very size-efficient.
- UTF-16 is a compromise, using 2 bytes most of the time, but expanding to 2 * 2 bytes per character to represent certain characters, those not included in the Basic Multilingual Plane (BMP).
I’m not writing that every developer should know about these details, I’m just writing that you should know about that very basic interface/implementation difference: Unicode is just an intermediate representation, not an actual encoding. I find it awful that .NET is so misleading here. Also, UTF-(8|16|32) are the most famous “implementations”, but there are several other standards, such as UTF-EBCDIC
If you’re an experienced programmer, please do not allow those approximations. If a code “guru” writes those sort of shortcuts, how will behaviors ever change about Unicode? Unicode is not black magic, you just have to be precise, and people will understand. Please use the right words. Over-simplifying concepts is not helping. Thanks.
Filed under: Uncategorized
You’re doing it wrong.
Actually, even if you generalize the problem to Mixing tabs and spaces in code, you’re still doing it wrong. When coding a program, a file in a project, something, you should choose an indentation style, n spaces, or a tabs, and stick to it:
The issue here is editor-dependent. Suppose that a coder A creates a program where tabs and spaces are mixed. Because A is not doing everything wrong, he tries to have a properly indented code, and indents identically each code line within a same code block. But the way a tabulation looks depends on the text editor of A: if, let’s say, his editor renders a tab as 4 spaces, then he will insert multiples of 4 spaces a bit everywhere in his code that will be fulfilling the same indenting role as a tabulation.
And here comes coder B, who uses an editor which renders tabs as 8 spaces… The code will look broken, you get the picture.
But of course in Python where indentation is a language requirement, it’s even worse. The code not only looks broken, odds are that it is broken. Better, if you’re very lucky, the syntax will still be correct, never raising any Indentation Error at any point in the code, but your control flow will be silently broken, functions returning too early, breaking out of loops too early, etc… Stealthy borked code. Sounds nice, right?
And today looking at the last code commits you notice that yet again, while you were sleeping, some contributor inserted tabs in your project that uses a 4-spaces indenting convention. Damn hippie.
Well, it’s okay, you got up early today, and you’re feeling smart, as usual. You’re going to fix this, to teach that dude a lesson. You quickly hack a small bash one liner that will replace tabs by 4 spaces:
for file in $(grep -Rl [[:cntrl:]] . | grep "\.py$"); do sed -i 's/^\t/ /g' $file; done
You run it in your project base repertory, hint at the changes — it works, I’m too good! — and commit/push them, and get back to work. Perfect, right?
Well, days later, when tens of commits piled up, people come to you complaining that this and that feature stopped working. You look at the commits, you see nothing wrong. Naturally, you assume that those punks got the command line arguments wrong, and tell them to try again.
But nothing does it. There’s a pretty major regression. After a few minutes you understand that this is the well known Stealthy code phenomenum, and take a closer look at the indentations fixes you made a few days ago. And of course you did it wrong. Assuming that a tab should be replaced by 4 spaces because your project uses a 4-space convention was a completely wrong supposition. PUNK!
And now you’re stuck doing painful merges to try to include the recent commits while fixing correctly the indentation issues. If you learned your lesson well — but chances are that you did not — you might be considering adding a python -tt à la CruiseControl dictatorial test to automatically reject any incoming patch containing tabs.
Filed under: Uncategorized
ProjectXXX/yyy.c:nn:cc: error: Python.h: No such file or directory
Don’t even start. The source you’re compiling is probably just fine. You’re doing it wrong.
You probably have python installed, but you do miss the development version and its corresponding headers. Look for python-dev or something, but don’t even think about complaining to ProjectXXX’ devs.
Filed under: Uncategorized
You’re probably doing it wrong. Sorry.
Header from dommend, cc-by-nc-sa.