Optimizing MySQL performance with accurate keys

MySQL performance is largely defined by keys and how efficiently queries can use them. As you scale, at certain point it isn't enough anymore to just have any indexes and still get a good performance in return. You have to really figure them out and allow your queries to do less work, as little work as possible.

The approach presented in this article can sometimes help designing such good, efficient indexes. As a consultant, I have to rely on it myself from time to time, having to optimize a query that works in a database I know nothing about.

Let's assume there is an application, which collects user activity in various places. The application uses a poorly indexed database, so there are plenty of examples to choose from. Our example query performs a full table scan, which means it reads all rows from the table it uses. It is also among the most popular statements executed by application.

mysql> EXPLAIN SELECT * FROM `checkins` WHERE user_id = 1410 AND checkin_source = 3 AND checkin_type IN (3,5)\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: checkins
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1039425
        Extra: Using where

A full table scan is indicated by type field, which simply says ALL. Another relevant information is the estimation of how many rows the query will have to read. MySQL thinks it may be 1039425, but in reality, with InnoDB tables, this number is usually be a bit skewed. It should still give the idea of the magnitude of the effort even without the fine accuracy. It is clear that the query needs an index to become more efficient, but what index exactly?

It is typically expected from a key to offer high cardinality, or in other words, to contain many distinct values. It allows an index lookup to hit fewer, more relevant rows. Often even only by looking at a query, it may be easy to conclude, which columns have good cardinality and which don't. The table from our example query holds information on users activity, so there will likely be a good mix of many different values in user_id column. But can there be equally many sources where user check-ins originate from? A reasonably good guess could be that the options are limited to a web page, a few named social networks, a mobile application and perhaps some more. In other words, not a lot.

This can always be verified:

mysql> SELECT COUNT(1) FROM `checkins` WHERE user_id = 1410;
+----------+
| COUNT(1) |
+----------+
|     6360 | 
+----------+

mysql> SELECT COUNT(1) FROM `checkins` WHERE checkin_source = 3; 
+----------+
| COUNT(1) |
+----------+
|   108623 | 
+----------+

It becomes evident that user_id is a better candidate for indexing than checkin_source as it finds less rows. A query executing against a key on the former column would match only 6360 at first, while one executing against a key on the latter would have to start with 108623 rows. Why start? Because filtering happens in two stages. If MySQL has an index that can be applied, it uses it to find matches, but only using the columns that exist in both WHERE clause and in the index (with some limitations too). Every matching row is then read in full length and any remaining conditions are applied. Starting with one hundred thousand rather than six thousand something rows would push nearly twenty times more of them for post-filtering in the second step. Looking at a bigger picture, it is not just twenty times more work for CPU, but also possibility many more I/O requests.

A cross-check on cardinality could also be done:

mysql> SELECT COUNT(DISTINCT user_id) AS cardinality FROM `checkins`;
+-------------+
| cardinality |
+-------------+
|       26360 | 
+-------------+

mysql> SELECT COUNT(DISTINCT checkin_source) AS cardinality FROM `checkins`;
+-------------+
| cardinality |
+-------------+
|          18 | 
+-------------+

There are over twenty six thousand unique user values in this table, while only eighteen different check-in sources, so user_id column has much higher cardinality.

But is adding an index on user_id really the best of what can be done? A new key could be created on only one of the columns referenced in the WHERE clause. Or it could be created on a combination of these columns - it would be so called composite index.

Why not try examining the data even further? How many rows does the original query actually match?

mysql> SELECT COUNT(1) FROM `checkins` WHERE user_id = 1410 AND checkin_source = 3 AND checkin_type IN (3,5);
+----------+
| COUNT(1) |
+----------+
|      832 | 
+----------+

That is even better than 6360 matched by the user alone. It means that using all three columns together can more significantly improve selectivity, i.e. reduce number of rows matched by a set of conditions. If this is somewhat consistent behavior, what can be verified through query statistics from MySQL slow log, the information can be used to improve the new index design. A composite key on (user_id, checkin_source, checkin_type) seems good candidate now, better than user_id alone, but now another question emerges... could it be shorter?

Indexes shouldn't be needlessly long, because they use disk space and memory. They also require a lot of maintenance work from database to keep them up-to-date, so if something does not really, to a measurable extent, benefit MySQL performance elsewhere, should not be part of any index.

What if only two out of the three columns where used:

mysql> SELECT COUNT(1) FROM `checkins` WHERE user_id = 1410 AND checkin_source = 3;
+----------+
| COUNT(1) |
+----------+
|      869 | 
+----------+

Thirty rows more is not going to turn into a measurable difference in most cases, so it should be okay to use a shorter index covering just (user_id, checking_source).

Conclusion

We considered four different keys for optimizing performance of the example query:

(user_id)
(checkin_status)
(user_id, checkin_status, checkin_type)
(user_id, checkin_status)

The last one offered the best factor of efficiency to cost being nine times more efficient than index number one and slightly less expensive than comparable index number three. Therefore we found the accurate index, which should help MySQL performance the most considering this particular query.

Blog