Dealing with Character Sets and Encodings in MySQL

Character sets

MySQL has support for a large range of character sets. At least for us, the most we'll deal with is CJK (Chinese, Japanese and Korean) on our shared hosting and dedicated servers, though this is equally applicable to other situations. There's a fair bit of official documentation on this. Unfortunately, character sets and encoding is always a bit of a minefield due to the way computing has developed over time.

Nowadays, unicode is the preferred character set, as it defines and allows the representation of pretty much every "glyph" (commonly known as a "character") in existence.

Encodings

Encoding refers to the way characters are stored in memory (RAM, HDD, on the wire, etc.). Specifically, how a character is translated into a sequence of bits. an encoding defines a method to turn a character into a series of bits that can be interpreted elsewhere. UTF-8 is a popular encoding for the Unicode character set. Don't conflate the two! UTF-8 uses a variable number of bytes to represent each character. Simple characters, including all the ASCII character set, only need one byte. Most CJK characters need a couple.

Unicode is hard

Why is Unicode (and its associated encodings) hard? Because it needs support at Every Step Of The Way. If we consider putting unicode into a mysql database for a webapp to display:

  1. You need to be able to input foreign characters (your Input Method Editor, or IME)
  2. Your terminal needs to accept unicode
  3. Your editor needs to handle unicode and write it to a file
  4. Your mysql client needs to accept/expect unicode bytes
  5. The mysql server needs to accept unicode on the wire
  6. The database/table has to be set to store unicode strings
  7. The database client connector (PHP? Java?) needs to work with unicode
  8. Your language needs support for handling unicode (particularly poor in PHP)
  9. You need to serve the data via HTTP with the correct charset header
  10. The browser needs to support unicode (almost certain nowadays)
  11. The OS needs fonts to display the international characters

Scenario

What's the problem?

We have a shared hosting customer's .sql file with a dump of their database. They need it imported into their MySQL database, but characters aren't displaying correctly on the webpage once they do. What can we do?

Something else you might see

A similar problem you might run into is the insertion of HTML entities in your database. They'll display okay on a webpage, but the characters stored in the database aren't what you think they are:

mysql> SELECT * FROM `users`;
+----+----------------------------------+----------+---------------------+
| id | name                             | password | email               |
+----+----------------------------------+----------+---------------------+
|  3 | Sakura                           | tomoyo   | sakura@gmail.com    |
|  0 | 眼鏡っ娘 | chen     | chen@gmail.com      |
+----+----------------------------------+----------+---------------------+
2 rows in set (0.00 sec)

What's going on here?? That second line is meant to say 眼鏡っ娘. Chances are the data was submitted via a form, maybe to a PHP script, and the characters got converted to the HTML entity representation somewhere along the way. If that's happened, this guide won't be of any use. You need to convert those entities back into the real characters and update the database.

What format?

I cat'd the file and saw gribble where there were foreign strings.

$ cat foo.sql 
-- �D��: localhost
-- �إߤ���: Sep 10, 2008, 03:11 PM
-- ���A������: 5.0.27
-- PHP ����: 5.2.3

Okay, this isn't in UTF8 (my terminal is UTF8 by default). Some judicious use of iconv helped me figure out what encoding the file was using. I already had hints that it was a Chinese file.

$ iconv -f gbk -t utf8 foo.sql 
-- iconv: illegal input sequence at position 3

$ iconv -f gb2312 -t utf8 foo.sql 
-- iconv: illegal input sequence at position 3

$ iconv -f big5 -t utf8 foo.sql 
-- 主機: localhost
-- 建立日期: Sep 10, 2008, 03:11 PM
-- 伺服器版本: 5.0.27
-- PHP 版本: 5.2.3

Aha, BIG5. BIG5 is a common encoding for Traditional Chinese characters. (if you can't see the Chinese characters here, your system isn't properly setup to read them; you might be missing fonts, or your browser isn't detecting the page correctly)

Text editor

Most of us here use vim. vim has unicode/utf8 support, and has had it for a while. Of course it has support for other character sets, so long as you tell it ahead of time what the expected encoding is and how to display it. This can be done in your ~/.vimrc

set encoding=big5
set termencoding=utf8
  • encoding is the expected input encoding for a file

  • termencoding tells vim what to display on the terminal

vim can automatically detect the correct encoding through the use of the fileencoding variable, which tells it what to attempt and in what order. It's empty by default on this shared hosting server, whereas it's ucs-bom,utf-8,default,latin1 on my Ubuntu workstation. vim runs through the list from left to right, stopping on the first one that "succeeds". UTF types can fail due to strict encoding rules. latin1 (and some others) will always "succeed" even if the output will be garbage (so it should go last in the list).

After adding the encoding and termencoding settings to my ~/.vimrc, I can open the sql file and all the foreign characters are displayed correctly now.

Terminal

Your terminal has to be equipped to deal with UTF8. I use GNOME terminal, and it's UTF8 by default. You can choose all the other encodings to meet your needs.

MySQL

Configuration

MySQL lets you set the encoding for the client, server, database, connection and results (and a couple other things). You can set BIG5 for a table charset, but everything else would have to work alongside this. SET NAMES can be used to set most of these appropriately.

SET NAMES 'big5';
SET NAMES 'utf8';

With a bit of testing, I can get BIG5 on my clipboard by setting the terminal to BIG5, setting vim to have termencoding=big5, then selecting some Chinese characters. Start the mysql command-line client, then SET NAMES 'big5';. I can now do INSERT INTO tablename, and can even SELECT it back properly. To be sure, I run SET NAMES 'utf8';` and watch it break, as expected.

We can't run SET NAMES all the time, so we can add an entry to our ~/.my.cnf file to set this for us.

[client]
default-character-set=utf8

If you check the configuration parameters, you can see the effect this has.

mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------+
| Variable_name            | Value                                  |
+--------------------------+----------------------------------------+
| character_set_client     | utf8                                   |
| character_set_connection | utf8                                   |
| character_set_database   | latin1                                 |
| character_set_filesystem | binary                                 |
| character_set_results    | utf8                                   |
| character_set_server     | latin1                                 |
| character_set_system     | utf8                                   |
| character_sets_dir       | /usr/local/mysql/share/mysql/charsets/ |
+--------------------------+----------------------------------------+
8 rows in set (0.03 sec)

yukari@yakumo:/tmp% diff -ruN before after
--- before 2008-09-22 15:51:36.000000000 +1000
+++ after 2008-09-22 15:51:37.000000000 +1000
@@ -1,11 +1,11 @@
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
-| character_set_client | latin1 |
-| character_set_connection | latin1 |
+| character_set_client | utf8 |
+| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
-| character_set_results | latin1 |
+| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ | 

Checking it works

Testing this is pretty easy, and we can verify that the results are what we expect, so long as your terminal is doing unicode properly.

$ cat ~/.vimrc
set encoding=big5
set termencoding=utf8

$ cat foo.sql
<characters should appear correctly>

$ vim ~/.vimrc
<change it to encoding=utf8>

$ cat foo.sql
<should be gribble>

$ iconv -f big5 -t utf8 -o goodfoo.sql foo.sql
<there should be no output>

$ vim goodfoo.sql
<should be readable now>

This is good. We've successfully converted the sql file from BIG5 to UTF8, and vim is setup to edit it properly (if we need to edit it).

DROP the table/s in MySQL if the .sql file doesn't have conditional DROP statements. An alternative is to DROP the database and recreate it.

DROP TABLES IF EXISTS `tablename`;

Now we're ready to import the dump.

$ mysql < goodfoo.sql
<there should be no output>

Run the mysql client and SELECT the contents of a table with foreign characters, it should be readable.

mysql> SELECT * FROM `admin`;
+----+-----------+----------+---------------------+-------+
| id | name      | password | email               | links |
+----+-----------+----------+---------------------+-------+
|  3 | Cirno     | yousei   | cirno@gmail.com     | 2     | 
|  0 | 李國鼎     | chen     | chen@gmail.com      | 2     | 
+----+-----------+----------+---------------------+-------+
2 rows in set (0.00 sec)

Application

The customer is using Java, thankfully (okay, I'm thankful it's not PHP at least). Java has solid unicode support, I believe they use UTF16 for their internal representation. They've given me a trivial test page which grabs the contents of a table and displays it in the browser. I use wget to confirm that they're sending UTF8 on the wire, just to make sure.

yukari@yakumo:~% wget -S -O /dev/null http://www.example.com.au/testB.jsp
--16:31:53--  http://www.example.com.au/testB.jsp
           => `/dev/null'
Resolving www.example.com.au... 202.4.236.211
Connecting to www.example.com.au|202.4.236.211|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Date: Mon, 22 Sep 2008 06:31:46 GMT
  Set-Cookie: JSESSIONID=2E9337B6B1741DA104519D926031FEDB; Path=/wpt
  Content-Type: text/html;charset=utf-8
  Content-Length: 685
  Vary: Accept-Encoding
  Keep-Alive: timeout=15, max=100
  Connection: Keep-Alive
Length: 685 [text/html]

100%[=================================================>] 685           --.--K/s             

16:31:53 (108.75 MB/s) - `/dev/null' saved [685/685]

Excellent, the Content-Type: header is exactly what I wanted to see, with the charset specified as utf8. I look at a lot of Japanese webpages, so I've already got all the necessary fonts installed. Everything displays as expected, and I can see the correct characters if I View Source of the the page. Note that it would be incorrect if I saw &#30524;&#37857;&#12387;&#23064; when I View Source; this would indicate a different problem, as mentioned earlier.

Other resources