Confirmation Bias of Sort

August 8, 2008

I'm not sure what the concept is called, but basically I'm talking about the fact that when you shovel something new into your brain you start seeing it more often because it stands out now. I never noticed how many Hondas (and now VWs) were on the road until I bought my own. Now whenever I see a black VW with a satellite antenna I check the license plate to make sure my car hasn't been stolen. (So far, so good.)

And now that I've had to dip my toe into the world of character encoding I see that more frequently too. Well, actually, I see it less frequently but it stands out more because when I do see it I know why it's there.

Ever see a blog entry, or even an online article, with this mishmash in it: — ? That's a long dash (—) with two errors in it. First off, it got sent into the content management system as UTF-8 but interpreted as characters. Then when it got spit out, the final character was interpreted as a curly-quote instead of a non-printing character because the browser assumes you're looking at a page made on Windows.

â is character 0xE2
€ is 0x80
” is really 0x201D but in Windows it shows up as 0x94

Convert those into binary and you get 0b11100010.10000000.10010100 -- a UTF-8 three-byte character. Strip out the markers and you're left with xxxx0010.xx000000.xx010100 which reogranizes into 0b00100000.00010100. In hex that's 0x2014, in decimal it's 8212, the code for a long dash.

Curly quotes, when typed correctly, can get misinterpreted as well, they show up as � (the question mark is part of the wrongly-encoded character and may show up as an empty box instead).

I'm just moderately amused that one character that itself gets done wrong from time to time manages to show up in another character done wrong thanks to using the wrong character set. I wonder if a character ever shows up in its own UTF-8 encoding. Could set up a neat little infinite loop there...

August 6, 2008August 15, 2008