Character sets and content encoding hell

This is a sister article to Dealing with Character Sets and Encodings in MySQL, which was a bit more theoretical. In this instalment we'll deal with a particular type of fun, which arises when content is double-encoded, or encoded ambiguously.

Character encodings are fureally really hard. Mysql regularly adds to those joys because it also has switches and dials for encodings.

As an example, this file might have been transcoded too many times.

[koyomi@ararararagi ~]$ hexdump -C geneva 
00000000  67 65 6e 65 76 61 3e c3  a2 e2 82 ac c2 a2 c3 82  |geneva>â�.�¢�.|
00000010  c2 a0 20 57 69 64 65 0a                           |  Wide.|
00000018

Encoding theory

A good understanding of encoding mechanics is most helpful when diagnosing these sorts of issues. There's far too much to cover here, but wikipedia's article on UTF-8 is a good thing to read, as UTF-8 is very widely used, and universally applicable.

The terms we're dealing in are:

  • Characters, as they're usually called, are defined in the unicode standard as "glyphs" or "codepoints"
  • Glyph is a semantic term referring to the visual appearance of the character, the codepoint is a unique number assigned to that character
  • The Unicode standard has room for about 4 billion codepoints - that's a lot
  • We need a way to encode them so that they can be transmitted in other protocols, like mail, http, etc
  • UTF-8 is one such encoding
  • UTF-8 isn't magic. All it's doing is encoding a number (codepoint) into a sequence of bytes that can be easily stored, transmitted and read

This last point, and the specifics of how it occurs, is the most important to understand, because it's what we'll be dealing with in this article.

Handy UTF-8 byte table

http://www.fileformat.info/info/unicode/utf8.htm

Need a quick check? UTF-8 just encodes Unicode bytes to a sequence of single bytes.

Literal bytes

Encoded range

Note

0x0000 - 0x007f

00       to 7f

1 byte

80       to c2-7f

INVALID RANGE, 80 to c1 leading byte is a dead giveaway

0x0080 - 0x07ff

c2-80    to df-bf

2 byte

df-c0    to e0-a0-7f

INVALID RANGE

0x0800 - 0xffff

e0-a0-80 to ef-bf-bf

3 byte

This is also a good reference: http://www.utf8-chartable.de/

Apostrophes suck anyway

Here's an example from a customer enquiry, the apostrophes in the DB are coming out wrong.

[mtepes@bund ~]$ echo "SELECT Heading FROM CaseStudyItem WHERE ID=2;" | mysql | hexdump -C
<snip>
65 20 6f 66 20 41 75 73  74 72 61 6c 69 61 c3 a2  |e of Australia..|
e2 82 ac e2 84 a2 73 20  6d 6f 73 74 20 63 68 61  |......s most cha|

That's "c3 a2 e2 82 ac e2 84 a2" for the apostrophe. Looking above, we can see this is three UTF8 "characters".

c3 a2    == encoded form of e2 (LATIN SMALL LETTER A WITH CIRCUMFLEX)
e2 82 ac == encoded form of 20 ac (EURO SIGN)
e2 84 a2 == encoded form of 21 22 (TRADEMARK SIGN)

Which is what they're displaying as on the page. Astute readers will note that e2 could be decoded again, as e2 would signal the start of a three-byte UTF8 character.

e2 20 ac == encoded form of err... hang on
21 22    == regular ascii for a bang and a quote (  !"  )

e2 20 ac is wrong, that's an invalid sequence. What's happening here, perhaps it's WINDOWS-1252? That's another common problem you get, so it's worth looking out for. It happens especially when people copy-paste content from editors like Word into a website CMS.

This is when you start pumping the output through tools like iconv to see if you can come up with any workable encodings.

In this particular case, it took a bit of a leap of insight (ignore the 0x0a, that's a line-ending).

kugyuu@orz:/tmp% cat case.txt| hexdump -C
00000000  c3 a2 e2 82 ac e2 84 a2  0a                       |.........|

ikamusume@shinryaku:/tmp% cat case.txt | iconv -f utf8 -t windows-1252 | hexdump -C
00000000  e2 80 99 0a                                       |....|

e2 80 99 is a single right-quote, which is probably what it should've been originally.

So what happened here?

  1. Someone wanted a single right-quote
  2. That was stored correctly in UTF-8, as e2 80 99

  3. Something read that, assuming it was encoded in Windows-1252
  4. It knew it wanted UTF-8 output, so it blindly encoded the assumed-to-be-windows1252 byte sequence into UTF-8
  5. That gave us c3 a2 e2 82 ac e2 84 a2

  6. c3 a2 e2 82 ac e2 84 a2 is valid UTF-8 for three other characters, which the browser displays: LATIN SMALL LETTER A WITH CIRCUMFLEX, EURO SIGN, TRADEMARK SIGN