Granting privileges may break replication in MySQL 5.6.10

MySQL lets database administrators define access rights on many levels – from the ability to run global commands down to access to individual columns. Some rights can be applied to many different objects, such as for example SELECT or UPDATE, which can be granted globally or restricted only to certain databases or tables, while others are only meant for one specific purpose. An example of the latter could be FILE privilege, which permits user to interact with the file system from inside a database instance. It only makes sense as the global right and not anywhere else.

As any other activity that produces changes, GRANT statements are replicated to MySQL slaves. Regardless of the binary log format setting, such events are always logged in STATEMENT format. It is likely because the command needs to handle more than just updating the contents of a few system tables, so such design allows each slave to fully execute the changes within their own environment.

This design, the fact that system tables are still non-transactional – they use MyISAM, as well as what I believe was an attempt to address another problem, causes an opportunity for a situation, when issuing an incorrect GRANT statement may break replication.

The following command attempts to grant FILE privilege on test database, which of course is not a valid operation, so MySQL refuses it and even gives an explanation of the problem:

master [localhost] {root} ((none)) > GRANT FILE ON test.* TO test@localhost;
ERROR 1221 (HY000): Incorrect usage of DB GRANT and GLOBAL PRIVILEGES

This is where it ended in previous MySQL versions. MySQL 5.6, however, makes one more step. It writes a binary log event:

#130409 16:49:02 server id 1  end_log_pos 432 CRC32 0xc2005658
# Incident: LOST_EVENTS
RELOAD DATABASE; # Shall generate syntax error
# at 432

This is called an incident event. LOST_EVENTS states that at that particular point some modifications may have happened, but they failed to reach the binary log. This triggers all slaves to bail out with the following error:

slave1 [localhost] {msandbox} ((none)) > SHOW SLAVE STATUSG
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 18675
[..]
				  Last_Errno: 1590
                   Last_Error: The incident LOST_EVENTS occured on the master. Message: error writing to the binary log
                 Skip_Counter: 0
[..]

Up until recently, this type of event was only used by replication of MySQL Cluster. MySQL 5.6 starts making use of it in many more places and one of them is sql_acl.cc:

  /*
    We only log "complete" successful commands, because partially
    failed REVOKE/GRANTS that fail because of insufficient privileges
    on the master, will succeed on the slave due to SQL thread SUPER
    privilege. Even though replication will stop (the error code from
    the master will mismatch the error code on the slave), the
    operation will already be executed (thence revoking or granting
    additional privileges on the slave).
    When some error happens, even partial, a incident event is logged
    instead stating that manual reconciliation is needed.
  */
  if (result)
    mysql_bin_log.write_incident(thd, true /* need_lock_log=true */);
  else
    result= result |
            write_bin_log(thd, FALSE, thd->query(), thd->query_length(),
                          transactional_tables);

This of course seems a reasonable way of preventing slaves from going out of sync – stop replication if unsure that everything will be okay.

In this case, however, the problem should be stopped sooner – by high level checks whether the command issed by a user makes any sense or not and not when the execution reaches a place where it triggers a much more serious problem (e.g. takes down a site by stopping MySQL slaves).

When this happens, the solution is typically to skip the incident event by using SQL_SLAVE_SKIP_COUNTER (when Global Transaction IDs are disabled) or injecting an empty transaction (when GTIDs are enabled). However unless you know for a fact that someone was messing around with privileges, there is no easy way of learning about the actual cause for which the incident event was logged. It also means you should check master and slave for differences after slaves catch up, even if only it means comparing the contents of system tables.

Bug report has been filed at http://bugs.mysql.com/68892.

Blog