More Than I Ever Wanted to Know About Character Encoding

July 21, 2008

This was on my Facebook news feed for a few days, so you can guess I spent a lot of time at work on it. You can find most of this information on Wikipedia, but I'm collecting it here, mostly for my own benefit later on.

Unicode

First off, let's cover what Unicode is and isn't. Unicode is basically a list of "code points" that represent concepts like letters, Chinese words, Japanese syllables, and stuff that has nothing to do with language like musical notes.

So I can say that a capital letter A is code point 65, or 0x41 in hexadecimal. 0x3041 is the hiragana syllable "ぁ", 0x2E9D is the CJK (combined Chinese, Japanese and Korean characters) radical for "moon." As a non-language example, 0x01D11E is a G Clef. Lots of stuff to worry about.

But that's it, really. Unicode says nothing about how you get those code points from Point A to Point B. To do that you need to encode the numbers, and that's where a couple days' worth of time sink came into play.

Character Encoding

Well, great. We have these code points but now we have to send them over the wire. In theory we can just do a straight dump of a bunch of binary data, but we run into a problem. Look back at that Japanese character: 0x3041. It's going to come in as two bytes: 0x30 0x41. Is that hiragana, or a zero followed by a capital A? Maybe we can figure it out from the language but even that's no guarantee.

We need to work a little magic on these code points to make them unambiguous.

UTF-16

Most operating systems these days use UTF-16 to represent Unicode characters. Just take the code points and send them as 16-bit words. For everything other than English this works just fine; most of the rest of the world has been using double-width characters for years anyway. There's no way to mistake the hiragana "ぁ" for a "0A" combo, because the Japanese is sent as 0x3041, while the others would be sent as 0x0030 0x0041.

That also demonstrates the one real drawback: if you're doing a lot of stuff in English you "waste" bytes by sending 7-bit ASCII as 16 bits. You also have to worry about endianness, which tells you which byte within a word is sent first -- when that hiragana character is broken up for transfer, both 0x30 0x41 and 0x41 0x30 are legitimate; you have to figure out which order you're getting your bytes in. In theory your receiving program will take care of this for you. In theory.

It's also a minor pain in the butt if you leave the Basic Multilingual Plane (BMP), the range between 0x00 and 0xFFFF. Our G clef would get sent as a pair of 16-bit words. If you're in the 0x01000 to 0x10FFFF range:

  1. Subtract 0x01000 from your character. This will drop you to something between 0x000000 and 0x0FFFFF.
    For the G Clef this gives 0x00D11E.
  2. Take the right 20 bits (0x00000 to 0xFFFFF) and split then into two groups of 10.
    For the G Clef you take your bits (0000.11010001.00011110) and split them up into (00.00110100) and (01.00011110).
  3. To your first set of bits, add 0xD800.
    (11011000.00000000 | 00000000.00110100 = 11011000.00110100 = 0xD834)
    To the second set of bits, add 0xDC00.
    (11011100.00000000 | 00000001.00011110 = 11011101.00011110 = 0xDD1E)
    Unicode has blocked out code points 0xD800 though 0xDFFF so there will never be collisions.
  4. Now you just have to decide on endianness and send.

Seems kinda complicated. But we're only getting started.

UTF-8

UTF-8 kills two birds with one stone: It lets you use single bytes for ASCII, and it gets around byte order problems by sending each byte individually in order.

Since we're dealing with individual bytes we need to tell the receiving system how many we plan on using for a given character. We do this by setting the high bits of the first byte.

Remaining bytes in a multi-byte character are formatted (10xxxxxx) -- since single-byte characters start with 0, there can never be collisions.

Since we start with a 0 for single bytes, that means we can only send up to 0x7F as normal ASCII. Everything else gets encoded the same way, regardless of the number of bytes:

  1. Break your character number up into groups of 6 bits starting on the right:
    ぁ = 0x3041 = (0011)(000001)(000001)
    G Clef = 0x1D11E = (00)(011101)(000100)(011110)
  2. To each byte group but the first, attach (10000000):
    ぁ = (0011)(10000001)(10000001)
    G Clef = (00)(10011101)(10000100)(10011110)
  3. To each first byte, set one bit for byte you're sending, then a 0, then zero-pad what's left to fill in the byte:
    ぁ = (11100011)(10000001)(10000001)
    G Clef = (11110000)(10011101)(10000100)(10011110)

That's all, folks. OK, I over-simplified the process a little (it would be possible to "bump" up into an extra byte if you have the right character) but that's the basic idea.

Our final characters in UTF-8, then, are:
A = 0x41
ぁ = 0xE3 0x81 0x81
G Clef = 0xF0 0x9D 0x84 0x9E

Downside: If you're doing a lot with Asian languages you wind up using 3 bytes instead of 2, increasing file size by 50%.

But What About E-mail, Smart Guy?

Oh, you didn't think we were done, did you? Now we have to e-mail that crapola. The body of the e-mail is fine, because we can just use HTML entities like ぁ for the ぁ. The subject is a different matter though, not being 8-bit clean. You have to work one more layer of voodoo on your UTF-8 if you have anything represented as multi-byte.

OK, just like before, we're gonna take our (now-converted) bytes and chunk 'em into groups on 6 bits, but now we're gonna start at the beginning instead of the end.

ぁ = 0xE3 0x81 0x81 = (111000)(111000)(000110)(000001)
G Clef = 0xF0 0x9D 0x84 0x9E = (111100)(001001)(110110)(001001)(100111)(10----)

We now have a bunch of groups that can store anything between 0 and 63. If only we had some way of representing this as a set of columns, like base-10 is 10^0 (ones place), then 10^1 (tens place), then 10^2 (hundreds place) and so on. It'd be like... base-64!

And that's what Base64 does. 0-25 are capital letters, 26-51 are lower case letters, 52-61 are 0-9, 62 is + and 63 is /. Base64 conversions have to come in groups of 4 characters, so if you have any gaps, like in the G Clef conversion, you pad it out with equals signs, so the decoding computer knows not to insert NULL characters.

So now we have:
ぁ = (111000)(111000)(000110)(000001) = (56)(56)(6)(1) = 44GB
G Clef = (111100)(001001)(110110)(001001)(100111)(10----) = (60)(9)(54)(9)(37)(32) = 8J2Jlg==

OK, we've done the Base64 conversion. Now what? How does the receiving mail program know I didn't have a seizure while I was typing?

By a magic number. Magical character sequence, in this case: =?UTF-8?B?44GB8J2Jlg==?=

That complicated mess gives the character encoding (UTF-8) and tells how it's encoded itself (B = Base64). The =? and ?= are markers to let us know when it starts and ends; you can have other stuff outside the escaped stuff. So if I wanted to send an e-mail with the subject of Aぁ it might come through as A=?UTF-8?B?44GB?=. It might also get encoded with the A inside the Base64 stuff, which would look different: =?UTF-8?B?QeOBgg==?=

So What Was the Big Deal?

Well, I haven't had a chance to learn ASP.NET yet, so all my web sites are still written in ASP 3.0, which first came out about 10 years ago. ASP 3.0 is based on VBScript, which is based on VB6, which means it does 16-bit integers. That can cause problems if you're dealing with code points above 32,000 or so, since negative numbers don't map to anything.

I already had code in place to handle stuff in the 0x7F to 0xFF range, so I could convert from Windows-1252 characters to Unicode. Since my sites are all served as either Windows-1252 or Latin-1, any other characters got converted by the browser when the user hits send.

That meant I had a bunch of text, with HTML entities in it. So I hacked together something quickly to parse it out. This bypasses the whole Int16 problem by just doing UTF-8. The part where I converted the UTF-8 to Base64 is left as an exercise for the reader.

Please note I'm not saying this is optimal. I'm just saying it works. As an FYI, the "and" keyword in VBScript is bitwise, and is the only bit operator (aside from "or") available to me. Everything else, like bit-shifting, has to be done mathematically.

for i = 1 to Len(Text) ' string position is 1-indexed
if Mid(Text, i, 2) = "&#" then
TmpString = ""
j = i + 2
while IsNumeric(Mid(Text, j, 1))
TmpString = TmpString & Mid(Text, j, 1)
j = j + 1
wend
TmpString = CLng(TmpString)
if TmpString < 2^7 then
' U+000000 to U+00007F (ASCII)
RetString = RetString & Chr(TmpString)
elseif TmpString < 2^13 then
' U+000080 to U+0007CF (Unicode, includes Arabic and non-ASCII Eurpoean)
RetString = RetString & Chr(&hC0 + ((TmpString and &h07C0) / 2^6)) & _
Chr(&h80 + (TmpString and &h003F))
elseif TmpString < 2^16 then
' U+000800 to U+00FFFF (Unicode, includes Chinese and Japanese)
RetString = RetString & Chr(&hE0 + ((TmpString and &hF000) / 2^12)) & _
Chr(&h80 + ((TmpString and &h0FC0) / 2^6)) & _
Chr(&h80 + (TmpString and &h003F))
elseif TmpString >= 2^16 then
' doesn't handle U+010000 to U+10FFFF, but doesn't really need to.
end if
i = j
else
RetString = RetString & Mid(Text, i, 1)
end if
next
Text = RetString

Now wasn't that fun?

July 18, 2008July 25, 2008