Enjoy, you’re doing it wrong!

Joel? Doing it wrong?
December 14, 2009, 6:31 am
Filed under: Uncategorized

There has been quite a lot a noise around Joel Spolsky these days. Joel is the author behind the much respected Joel On Software technical blog, and the co-founder of the now famous Stack Overflow.

The target of the recent critics (Splosky has balls of steel and a brain of feathers, Joel Inc., Stackoverflow Careers and Jumping Sharks) is mainly his new website, Careers Stack Overflow: “Stack Overflow Careers might be a bridge too far, writes William Shields.

Well. As a reader of Joel, and as a fairly successful user of Stack Overflow, I was not really paying attention to the rants. For sure, I disagree with the concept of Careers, and I dislike the way Joel uses his blog to communicate about the site. But I was thinking “heck, everyone has to make money!“.

But his last blog entry went too far.

The claim

Joel shows an interesting histogram, the ratio of Stack Overflow users that have submitted a CV to Careers, in relation to their Stack Overflow reputation:

Histogram retrieved from Joel's blog

And Joel comments:

The higher someone’s Stack Overflow reputation, the more likely they are to have submitted a CV to Stack Overflow Careers. […] it somewhat confirms the claim we’re making to employers, which is that when you search for CVs on Stack Overflow, you are looking at some pretty gosh darn good programmers.

Really? Joel, how honest are you with your readers here?

Debugging that statement

First, there is the somewhat questionable assertion that users with high reputation on Stack Overflow are “darn good programmers”. While this might be true for users with 15K+ reputation, I am not sure the rule still stands for others. And of course, I am not sure I would like to hire someone spending an important part of his work time answering over and over the same questions instead of doing what I’m paying him for. But let’s be fair here: why not. For the sake of discussion, we’ll assume that the higher one’s Stack Overflow reputation is, the more skilled he is, and the more interesting he is to Joel’s corporate clients.

Then. “The higher someone’s Stack Overflow reputation, the more likely they are to have submitted a CV to Stack Overflow Careers.“. This part is correct, according to Joel’s stats, and I won’t question it.

But the conclusion, “when you search for CVs on Stack Overflow, you are looking at some pretty gosh darn good programmers” as he puts it, is somewhat dishonest.

The relevant statistic

We want to know how good are the programmers on Careers. Or, how high is really the Stack Overflow reputation of Careers users. Well, let’s find this!

It is easy to categorize Stack Overflow users by reputation, using the /users page, a simple binary search, a pen and a paper. You can approximate Joel’s stats from his graph. Take your favorite language, a plotting library, and here is the code!

After some Inkscape prettyfication:

Repartition of Stack Overflow reputation of careers users

Of course this histogram only takes into account Careers users that are Stack Overflow users.

But it looks like when I look at a random CV, I have 40% of chances of stumbling on an user that has 1k to 2k reputation. No offense to you if you have this much rep’ on Stack Overflow, but it is quite meaningless when it comes to recruiting you.

So, pretty gosh darn good programmers? Certainly


Repeat after me: Unicode is not UTF-\d{1,2}
June 22, 2009, 5:23 am
Filed under: Uncategorized

Stop. Unicode is not UTF-16, Unicode is not UTF-32, and Unicode is not UTF-8. Stop it.

This is just plainly wrong. Unicode is a standard, and UTF-(8|16|32) are character encodings that support encoding Unicode data into byte strings. UTF-16 is capable of encoding the whole Unicode repertoire, but UTF-16 is not Unicode, and Unicode is not UTF-16.

And no, it’s not almost the same. It’s nowhere to be close. Think about it, you take some beginners, and introduce them to programming. After a while they get the pleasure to be introduced to “Ãœ” “ö” “?” and their friends, and it’s time for them to learn about the huge Unicode monster. You point them to a great article, wait a bit, and head for a little Q&A session.

Inevitably, you will get strange formulations such as “I see, I will encode everything to Unicode”, or “but how can I tell if this string is Unicode-encoded?”, or better “all this time I’ve been using UTF-8 without knowing that it was in fact Unicode magic?”. Well, If you think that these approximate formulations are fine, and that you should not correct your beginner programmer, you’re just doing it wrong.

A byte string cannot be Unicode-encoded. It cannot. You either work with encoded byte strings, or with unicode data (a sequence of code-points). But you can’t “Unicode-encode”, it’s non-sense.
Similarly you cannot encode a string to Unicode. You cannot. You decode a byte string from a source character set to Unicode data: code-point sequences.

Repeat after me:

  • when transforming data from Unicode to a specific character set, you are encoding Unicode codepoints into bytes, according to a specific character encoding that suits your target character set.
  • when transforming from (byte) strings to Unicode, you are decoding your data: if you fail to provide the right character encoding to decode from, you will end up with borked Unicode data.

It’s not “okay” to let beginners work with approximate knowledge of the Unicode buzzword. They will eventually get confused, and you will end up losing time re-explaining over and over the same things. Approximate formulations reflect approximate knowledge, and you should not let that be. Approximate Unicode knowledge is the blocker, the main reason why everything we do is not (yet) in Unicode.

Because of these kind of approximations, we had broken Unicode support in Python until Python 3.0, where Unicode data and byte strings were deriving from a common class. Because of these kind of approximations, we have hundreds of beginners not understand the difference between UTF-8 and Unicode, and not understanding why string.encode('utf-8') can throw an error: you see, you just said that it was okay to “Unicode-encode”, and that UTF-8 is Unicode, so basically they are trying to “encode” strings as… Unicode and the fact that it fails is just puzzling them because Unicode was supposed to be the magical remedy to all their encoding errors.

Because of these approximations, the .NET property Encoding.Unicode is the shortcut for what should be Encoding.UTF16. There are Encoding.UTF8, Encoding.UTF32, and Encoding.ASCII, and in the middle of those… Encoding.Unicode. How can developers write such wrong things? Unicode is not an Encoding, UTF16 is not Unicode. Just look at the wonderful C# construct Encoding u16LE = Encoding.Unicode; taken directly from the documentation: congratulations, you are assigning an “Unicode” value to to an “Encoding” type. Crystal clear.

A good image, perhaps, to explain the fundamental type difference between Unicode and, let’s say, UTF-16, would be to assimilate Unicode as an Interface, and UTF-16 as a concrete class implementing the Unicode interface.

In one hand, Unicode does not define any implementation: it defines no data representation, only an international unequivocal way to associate a character to a code-point. You could store a list of code-points to represent Unicode data, yes, but doing this forces you to store 4 bytes per character because of the large code-point range. This is rather inefficient, and this is why UTF-* appeared. The whole idea is to map Unicode data to byte strings, choosing yourself the mapping function, so that the resulting representation fits your needs. In a way, you have many different strategies implementing the same interface, depending on your focus:

  • UTF-8 focuses on minimizing the byte size for representation of characters from the ASCII set (variable length representation: each character is represented on 1 to 4 bytes, and ASCII characters all fit on 1 byte). As Joel puts it:

    “Look at all those zeros!” they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn’t have minded guzzling twice the number of bytes. But those Californian wimps couldn’t bear the idea of doubling the amount of storage it took for strings

  • UTF-32 focuses on exhaustiveness and fixed-length representation, using 4 bytes for all characters. It’s the most straightforward translation, mapping directly the Unicode code-point to 4 bytes. Obviously, it’s not very size-efficient.
  • UTF-16 is a compromise, using 2 bytes most of the time, but expanding to 2 * 2 bytes per character to represent certain characters, those not included in the Basic Multilingual Plane (BMP).

I’m not writing that every developer should know about these details, I’m just writing that you should know about that very basic interface/implementation difference: Unicode is just an intermediate representation, not an actual encoding. I find it awful that .NET is so misleading here. Also, UTF-(8|16|32) are the most famous “implementations”, but there are several other standards, such as UTF-EBCDIC

If you’re an experienced programmer, please do not allow those approximations. If a code “guru” writes those sort of shortcuts, how will behaviors ever change about Unicode? Unicode is not black magic, you just have to be precise, and people will understand. Please use the right words. Over-simplifying concepts is not helping. Thanks.

Mixing tabs and spaces in Python
March 24, 2009, 7:00 am
Filed under: Uncategorized

You’re doing it wrong.

Actually, even if you generalize the problem to Mixing tabs and spaces in code, you’re still doing it wrong. When coding a program, a file in a project, something, you should choose an indentation style, n spaces, or a tabs, and stick to it:

The issue here is editor-dependent. Suppose that a coder A creates a program where tabs and spaces are mixed. Because A is not doing everything wrong, he tries to have a properly indented code, and indents identically each code line within a same code block. But the way a tabulation looks depends on the text editor of A: if, let’s say, his editor renders a tab as 4 spaces, then he will insert multiples of 4 spaces a bit everywhere in his code that will be fulfilling the same indenting role as a tabulation.

And here comes coder B, who uses an editor which renders tabs as 8 spaces… The code will look broken, you get the picture.

But of course in Python where indentation is a language requirement, it’s even worse. The code not only looks broken, odds are that it is broken. Better, if you’re very lucky, the syntax will still be correct, never raising any Indentation Error at any point in the code, but your control flow will be silently broken, functions returning too early, breaking out of loops too early, etc… Stealthy borked code. Sounds nice, right?

And today looking at the last code commits you notice that yet again, while you were sleeping, some contributor inserted tabs in your project that uses a 4-spaces indenting convention. Damn hippie.

Well, it’s okay, you got up early today, and you’re feeling smart, as usual. You’re going to fix this, to teach that dude a lesson. You quickly hack a small bash one liner that will replace tabs by 4 spaces:

for file in $(grep -Rl [[:cntrl:]] . | grep "\.py$"); do sed -i 's/^\t/ /g' $file; done

You run it in your project base repertory, hint at the changes — it works, I’m too good! — and commit/push them, and get back to work. Perfect, right?

Well, days later, when tens of commits piled up, people come to you complaining that this and that feature stopped working. You look at the commits, you see nothing wrong. Naturally, you assume that those punks got the command line arguments wrong, and tell them to try again.

But nothing does it. There’s a pretty major regression. After a few minutes you understand that this is the well known Stealthy code phenomenum, and take a closer look at the indentations fixes you made a few days ago. And of course you did it wrong. Assuming that a tab should be replaced by 4 spaces because your project uses a 4-space convention was a completely wrong supposition. PUNK!

And now you’re stuck doing painful merges to try to include the recent commits while fixing correctly the indentation issues. If you learned your lesson well — but chances are that you did not — you might be considering adding a python -tt à la CruiseControl dictatorial test to automatically reject any incoming patch containing tabs.

March 19, 2009, 6:54 am
Filed under: Uncategorized

ProjectXXX/yyy.c:nn:cc: error: Python.h: No such file or directory

Don’t even start. The source you’re compiling is probably just fine. You’re doing it wrong.

You probably have python installed, but you do miss the development version and its corresponding headers. Look for python-dev or something, but don’t even think about complaining to ProjectXXX’ devs.

Well, if you’re reading this…
March 18, 2009, 7:24 am
Filed under: Uncategorized

You’re probably doing it wrong. Sorry.

Header from dommend, cc-by-nc-sa.