MySQL Character encoding

Breaking and unbreaking your data

Recently at FOSDEM, Maciej presented "Breaking and unbreaking your data", a presentation about the potential problems you can incur regarding character encoding whilst working with MySQL. In short, there are a myriad of places where character encoding can be controlled, which gives ample opportunity for the system to break and for text to become unrecoverable.

The slides from the presentation are available on slideshare.

[embed]http://slideshare.net/mushupl/character-encoding[/embed]

Since slides don't tell the whole story, we decided to create a series of blog posts to demonstrate how easy it is to go wrong, how to fix some of the issues and how to avoid such issues in the future.

What is character encoding?
The encoding is the binary representation of glyphs, where each character can be represented by 1 or more bytes. Popular schemes include ASCII and Unicode, and can include language specific character sets such as Latin US, Latin1, Latin2 which are commonly used in America and Europe and EUC-KR or GB18030 which support language characters with an Asian origin. Each character can be associated by several different codes, and one code may correspond to several different characters, depending on the encoding scheme used.

Where do you set character sets in MySQL?
Here is the core of the problem, the character encoding can be controlled from the application, database or even on a per table or column basis. Together with a set of rules regarding inheritance, it is easy to have one layer of the system configured for one character set whilst the actual data being introduced is using a different character set.

In MySQL the following area, the following settings can all affect the character encoding used.

Session settings
- character_set_server
- character_set_client
- character_set_connection
- character_set_database
- character_set_result
Schema level defaults
Table level defaults
Column charsets

Character encoding in MySQL.
As Maciej pointed out in the presentation, where MySQL is concerned we are all born Swedish, as MySQL starts configured for the Latin1 character set and collation set to latin1_swedish_ci. This is even the case in MySQL 5.7, meaning by default your system expects only characters in the latin1 set and will when comparing characters it will assume the Swedish language is being used.

Lets look at how this manifests itself in a new application, where server, client and table are set to the default latin1.


mysql> SELECT @@global.character_set_server, @@session.character_set_client;
+-------------------------------+--------------------------------+
| @@global.character_set_server | @@session.character_set_client |
+-------------------------------+--------------------------------+
| latin1                        |                         latin1 |
+-------------------------------+--------------------------------+
1 row in set (0.00 sec)
mysql> CREATE SCHEMA fosdem;
Query OK, 1 row affected (0.00 sec)
mysql> USE fosdem;
mysql> CREATE TABLE locations (city VARCHAR(30) NOT NULL); 
Query OK, 0 rows affected (0.15 sec);

mysql> SHOW CREATE TABLE locationsG
*************************** 1. row ***************************
Table: locations
Create Table: CREATE TABLE `locations` (
`city` varchar(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec)

So what happens when you try to save some data that is not latin1 encoded.

The city of Tokyo is displayed.

The application returned and rendered the new city correctly, however inside the database there is some confusion.

mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from locations;
+--------------------+
| city               |
+--------------------+
| Berlin             |
| KrakÃ3w            |
| æ±äo¬éƒ1⁄2         |
+--------------------+
3 rows in set (0.00 sec)

The data being saved was UTF8 encoded, however if an application attempts to query the database as UTF8 it receives garbage. Instead the application must ask for Latin1 to receive the original data.

mysql> SET NAMES latin1;
Query OK, 0 rows affected (0.00 sec)
mysql> select
+-----------+
| city      |
+-----------+
| Berlin    |
| Kraków    |
| 東京都     |
+-----------+
3 rows in set
* from locations;
(0.00 sec)

The new city was saved and from the application the result looked correct, however what is happening here is that the connection to the database has saved the binary data without any manipulation. Hence it returned the same data, and the browser was able to do the right thing and display it correctly, as did the terminal which was set to UTF8. Inside the database though, it is not able to understand the data in the correct context.

In the next blog post we will look at how to correctly configure character sets, as well as demonstrating some of the problems we have encountered in production systems and how we fixed those.

Blog

MySQL Character encoding - part 1

Need a MySQL Expert?

Register now to get
your free support.

Tag Cloud

PSCE

How to quickly patch a MySQL server against CVE-2016-6662?

Unobvious "Unknown column in 'field list'" error

WebScaleSQL builds available for Debian 8 and Ubuntu 15.04