Data fragmentation problem in MySQL & MyISAM

The other day at PSCE I worked on a customer case of what turned out to be a problem with poor data locality or a data fragmentation problem if you will. I tought that it would make a good article as it was a great demonstration of how badly it can affect MySQL performance. And while the post is mostly around MyISAM tables, the problem is not really specific to any particular storage engine, it can affect a database that runs on InnoDB in a very similar way.

The problem

MyISAM lacks support for clustering keys or even anything remotely similar. Its data file format allows new information to be written anywhere inside a table. Anywhere can be either at the end of a file where it can be simply appended or an empty space somewhere in the middle left after previously deleted row(s). This implies no particular order in which rows are stored unless there are absolutely no deletes.

Unlike MyISAM, InnoDB not only supports clustering keys, but even implicitly assumes one for every table. This way primary keys also become clustering keys. Such index determines the physical order of rows in a table. This can be leveraged to benefit frequent queries which read many rows having something in common. The examples of such queries could be asking for all data that belongs to a certain user or all messages saved in a single folder.

The benefits I mentioned appear when data size is big enough to shift the workload type from CPU-bound to disk-bound, or in other words when these queries often do not find the information they need in memory and need to read it from disk much of the time. Good data locality means that all the relevant information can be accessed with very few sequential reads rather than with many small I/Os to random places on disk. It can make a dramatic difference to database performance. A detailed example was given in this article.

The problem of data locality or fragmentation applies to MyISAM tables the same way it does to InnoDB tables. And while you can easily design a good InnoDB table that largely avoids this problem by making a good use of clustering indexes or use things like extended slow log for analysis to determine whether or not some tables are somehow affected by the problem, that does not work with MyISAM. But there is hope. It is possible to verify and manage data fragmentation with MyISAM.

Examining data

What happens when you do a full table scan on an InnoDB table? You get the result that is sorted by the primary key. There is no way to do any kind of sequential read to return rows in the actual order in which they appear inside a data file.

MyISAM is much simpler in that matter. Its data files do not have any elaborate structure that would require some special way of accessing rows. If you do a full table scan on a MyISAM table, MySQL just goes through the file reading them sequentially which is almost no different than reading a flat file:

mysql> INSERT INTO a (id) VALUES (2), (1);

mysql> SELECT * FROM a;                     
+----+------+
| id | foo  |
+----+------+
|  2 | NULL |
|  1 | NULL |
+----+------+

mysql> DELETE FROM a WHERE id = 2;

mysql> INSERT INTO a (id) VALUES (3), (4);

mysql> SELECT * FROM a;                     
+----+------+
| id | foo  |
+----+------+
|  3 | NULL |
|  1 | NULL |
|  4 | NULL |
+----+------+

This behaviour can actually be used to examine the physical organization of rows on disk. All that is needed for the analysis of data fragmentation is a complete dump of values from the column(s) that identify the sets of rows we want in as small number of fragments as possible. If for example we had a table that listed files users own, we'd be interested in examining the column that carries user identifiers, because we'd like all files that belong to a single user to be in one place on disk.

It is of course important that the dump is taken through a full table scan, so any indexes that contain the relevant column(s) should be explicitly disabled during the operation:

mysql> SELECT user_id FROM files IGNORE INDEX (ix_on__user_id_and_uploaded, 
           uk_on__user_id_and_file_id) INTO OUTFILE '/tmp/files-user_id.txt';

You can also make sure the query does not use any index by running EXPLAIN on it first. If it says type: ALL, you can use it to dump data.

The result is a relatively small file containing the list of user_id values exactly in the order as they appear in rows on disk:

# head -6 /tmp/files-user_id.txt
12404
12404
12404
12398
12404
12401

What this tells us is that three files that belong to user 12404 are first, followed by a file that belongs to user 12398, then one that belongs to 12404 again, and so on. Using this information we can easily calculate how fragmented the data is:

% sort files-user_id.txt | uniq | wc -l
1480
% uniq files-user_id.txt | wc -l       
40269366

In the example files that belong to 1480 different users are stored in 40269366 contiguous blocks (that's over 40M!). By a contiguous block I mean one or more files that belong to the same user stored next to each other (for example 12404-12404-12404, 12398, 12404 makes three such blocks). And this result seems bad result for data locality as ideally the two numbers should be reasonably close.

Of course that is only a very simple and by no means realistic calculation, because it does not account for row lengths or disk block sizes, but it's a number that gives some idea about how data looks like inside of a .MYD file. With a little bit more elaborate script that actually does something about the two factors I just mentioned, the result could be slightly different:

% ./frag.py --file=files-user_id.txt --row-length=168
1480 sets are stored in 17634553 fragments.

It is probably a little bit more accurate information, but this is less about accuracy and more about getting an idea whether there could be a problem with poor data locality or not. Going further we can start dumping detailed information about each user:

% ./frag.py --file=files-user_id.txt --row-length=168 --verbose
UserID    Count     Fragments
..
176175    60945     367
81045     1         1
80083     1170866   515153
68042     12        1
..
1480 sets are stored in 17634553 fragments.

We can see that user 176175 has 60945 rows in only 367 fragments, while 80083 has rows scattered across over half million (!) different places. Accessing each fragment likely requires one I/O, so the latter might need as many as 515153!

Addressing the problem

InnoDB with its clustering index does all the work and internally maintains the proper sort order of rows. It does not actually prevent data fragmentation from happening entirely, but it can help keeping similar rows in relatively small number fragments and in larger groups. Since MyISAM doesn't do that, managing data locality will always have to be part of database maintenance work.

One of the tools shipped with MySQL is myisamcheck. It is primarily designed to allow fixing broken MyISAM tables. However it comes with one option that can be used to improve data locality:

% myisamchk --help
..
  -R, --sort-records=#
                      Sort records according to an index.  This makes your
                      data much more localized and may speed up things
                      (It may be VERY slow to do a sort the first time!).
..

In my example I'd have to use it on an index that begins with user_id column as that would place all the rows with the same user_id value next to each other. Unfortunately myisamcheck cannot be used on a file that can be used by MySQL. That's why this method can only be used on a standby slave that can be stopped for a while or during a maintenance window.

Right after defragmentation the statistics should start looking much better:

% ./frag.py --file=files-user_id-defragmented.txt --row-length=168 --verbose
1480 sets are stored in 1480 fragments.

1480 in 1480 fragments is just perfect.

The gain

In disk-bound workloads the gain from improving data locality can be enormous. Here is a real life example of one query that was executed two times - before the data file optimization and after - in MySQL instance running on top of a 3.6GHz CPU and a 30-disk RAID50 volume. Each time it was run on empty caches to see the worst case scenario as it had to read everything from disk:

Before defragmentation: 1103702 rows in set (52 min 45.29 sec)
After defragmentation: 1103702 rows in set (0.58 sec)

That is several orders of magnitude slower when data was fragmented. How many I/O operations were needed? With approximately 200 IOPS the first query needed 53 minutes, so it gives approximately 636000 I/O operations, almost every single one was 4KB long (a single block read). The estimation was for 515153 I/O operations, which wasn't too far away from the real number. The second query needed only a couple hundred long reads.

The same experiment was also repeated on a system with four Fusion-io cards in RAID 10 configuration:

Before defragmentation: 1103702 rows in set (1 min 22.28 sec)
After defragmentation: 1103702 rows in set (4.13 sec)

Using Fusion-io didn't really help to mitigate the data fragmentation problem. While the slow query ran 50 times faster compared to a system with a storage based on regular hard drives, almost one and a half minute was still a very poor result.

Conclusions

Data fragmentation can become a very serious problem affecting MySQL performance once the working set grows beyond a certain size that can no longer be cached in memory. Keeping the fragmentation under control can guarantee that even disk-bound queries can execute quickly, but it also reduces the pressure on the storage. Unfortunately MyISAM has no built-in features that would help with managing data locality or either limit or prevent fragmentation in any way. It has to be taken care of as part of a regular database maintenance work. That is why it may usually be better to consider a migration to InnoDB if data locality can become an important factor.

Extra thought: even the most capable flash-based storage may not really solve some basic problems!

Blog