(My)SQL mistakes. Do you use GROUP BY correctly?

Often I see a SQL problem solved incorrectly and I do not mean inefficiently. Simply incorrectly. In many cases the developer remains unaware that they aren't getting the results they were expecting or even if a result is correct, it is only by chance, for example because the database engine was smart enough to figure out some non-sense in a query. In a few posts I will try to disclose some of the more common problems.

Aggregate with GROUP BY

Unlike many other database systems, MySQL actually permits that an aggregate query returns columns not used in the aggregation (i.e. not listed in GROUP BY clause). It could be considered as flexibility, but in practice this can easily lead to mistakes if a person that designs queries does not understand how they will be executed. For example, what values an aggregate query returns for a column that wasn't part of the grouping key?

mysql> SELECT user_id, id, COUNT(1) FROM bets WHERE user_id = 99 GROUP BY user_id;

+---------+-------+----------+

| user_id | id    | COUNT(1) |

+---------+-------+----------+

|      99 | 12857 |       12 |

+---------+-------+----------+

id column is a surrogate key and carries a unique value in each and every row, so does 12857 in the query result make any sense, then? Why 12857 and not any other value when user 99 also has eleven other rows in the table? Unless id only had a single value throughout all user's rows or I actually wanted to see a single randomly chosen value there, the result is probably not what I needed. GROUP BY does not care about columns that are not part of the aggregate key, so if your query requests them, you should not rely too much on the values they return.

At the same time MySQL comes with a number of aggregate functions. They can process data from the individual rows while GROUP BY is executing. For example I can aggregate by user_id, but also remember and then list all the values from id column:

mysql> SELECT   user_id,

         GROUP_CONCAT(id) _id,

         COUNT(1)

FROM     bets

WHERE    user_id = 99

GROUP BY user_id;

+---------+-------------------------------------------------------------+----------+

| user_id | _id                                                         | COUNT(1) |

+---------+-------------------------------------------------------------+----------+

|      99 | 2857,2856,2858,2851,2852,2855,2853,2854,3201,3200,3262,3261 |       12 |

+---------+-------------------------------------------------------------+----------+

Such result probably makes more sense to my application, because it received the complete information rather than only a random piece.

Sorting an aggregation

What if I needed to grab a few users that most recently made an action based on an activity table? The first thing that comes to my mind:

mysql> SELECT user_id, bet_date FROM bets GROUP BY user_id ORDER BY bet_date DESC LIMIT 3;

+------------+---------------------+

| user_id    | bet_date            |

+------------+---------------------+

|         99 | 2009-12-08 22:51:38 |

|         93 | 2009-11-03 12:39:07 |

|         95 | 2009-09-29 09:23:07 |

+------------+---------------------+

It was easy, wasn't it? However, is the result correct?

ORDER BY is applied after GROUP BY, so it operates on a set that has been already aggregated and not on the individual rows. And how did the aggregation work? For each user GROUP BY collapsed a number of values from bet_date into a single date. This way user 99 received "2009-12-08 22:51:38". But why did it get that particular value? Was it the most recent date among the user's records as per ORDER BY caluse? Let's examine the raw data:

mysql> SELECT user_id, bet_date FROM bets WHERE user_id = 99 LIMIT 5;    

+------------+---------------------+

| user_id    | bet_date            |

+------------+---------------------+

|         99 | 2009-12-08 22:53:20 |

|         99 | 2009-12-08 22:53:09 |

|         99 | 2009-12-08 22:53:37 |

|         99 | 2009-12-08 22:51:38 |

|         99 | 2009-12-08 22:51:58 |

+------------+---------------------+

Just by looking at these rows it becomes clear that "2009-12-08 22:51:38" could be anything except the most recent entry, so the answer is 'no'. So why was it picked for the result? Because MySQL did not evaluate the sort order at the time it was building the aggregation. In fact "2009-12-08 22:51:38" came from the very first row that MySQL saw for user 99. In different circumstances the same query on the same data set could return a different value there.

This means the original query did not return the result I expected, because ORDER BY was applied on a limited set of somewhat random values picked by GROUP BY and never saw most values from bet_date.

How to make it work?

I can rely on the aggregate functions again, which gives me some access to the raw values:

mysql> SELECT   user_id,

         MAX(bet_date)

FROM     bets

GROUP BY user_id

ORDER BY MAX(bet_date) DESC

LIMIT    3;  

+------------+---------------------+

| user_id    | MAX(bet_date)       |

+------------+---------------------+

|         91 | 2010-05-22 18:49:41 |

|         92 | 2010-05-22 17:18:50 |

|         88 | 2010-05-22 15:14:42 |

+------------+---------------------+

This looks better.

Now, is there a way to learn some other column's value for these most recent entries? The following query cannot work correctly. Why? For the reasons that were covered in the first part of this post. game_id will carry any value from the aggregated set, so a value that may or may not be related to the row that has the most recent bet_date:

mysql> SELECT   user_id,

         game_id,

         MAX(bet_date)

FROM     bets

GROUP BY user_id

ORDER BY MAX(bet_date) DESC

LIMIT    3;  

+---------+---------+---------------------+

| user_id | game_id | MAX(bet_date)       |

+---------+---------+---------------------+

|      91 |     832 | 2010-05-22 18:49:41 |

|      92 |     831 | 2010-05-22 17:18:50 |

|      88 |     898 | 2010-05-22 15:14:42 |

+---------+---------+---------------------+

A query to solve such problem may need to be a lot more complex and will rarely be very efficient. That is why often it is better to create a summary table for easy querying, instead of running an aggregate query every time such information is required. Here is the example of a query that returns correct result:

mysql> SELECT   _d.user_id,

         bets.game_id,

         _d.bet_date

FROM     (SELECT  user_id,

                  MAX(bet_date) bet_date

         FROM     bets

         GROUP BY user_id

         ORDER BY MAX(bet_date) DESC

         LIMIT    3

         )

         _d

         JOIN bets

USING    (user_id, bet_date)

ORDER BY _d.bet_date DESC;

+---------+---------+---------------------+

| user_id | game_id | bet_date            |

+---------+---------+---------------------+

|      91 |    1004 | 2010-05-22 18:49:41 |

|      92 |    1004 | 2010-05-22 17:18:50 |

|      88 |    1004 | 2010-05-22 15:14:42 |

+---------+---------+---------------------+

Summary

Using GROUP BY can become tricky beyond doing simple aggregations. Due to the relaxed restrictions in MySQL on how the clause can be used in a statement, it is easy to create queries, which do not work correctly. At the same time database does not issue any warnings of a possible problem, so it is entirely up to you to verify whether results are correct and meet your expectations.

Blog