Analyzing I/O performance

There are probably thousands of articles on the Internet about disk statistics in Linux, what various columns mean, how accurate the information is, and so on. I decided to attack the problem from a little bit more practical side. Hopefully this will be just the first of many future posts on identifying various I/O related performance problems on a MySQL server.

Linux exposes disk statistics through /proc/diskstats. However the contents of this file isn't something anyone can understand quickly. It needs a tool to transform the information into something human readable. A tool that is available for any Linux distribution is called iostat and comes with sysstat package.

How to access and read I/O statistics

Usually you want to call iostat one way:

iostat -xkd <interval> <block device>

The interval should typically be one second as it is the base unit for many things, but also because it gives the best level of detail. With longer intervals some of it could be averaged out. This is a typical example of where any analysis begins:

bash-4.2 # iostat -xkd 1 /dev/drbd0

[..]

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          22.35    0.12    9.61    3.12    0.00   64.79

Device: rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
drbd0     0.00     0.00   16.00   79.00   192.00  1408.00    16.84     2.66   27.96   3.75  35.60

Note: Make sure to ignore the first round of output iostat produces as it shows the statistics concerning the time since the system was booted rather than only a current period.

What does it say? In interval of the specified length, /dev/drbd0 device was doing 16 reads per second (r/s) and 79 writes per second (w/s). The average I/O request size was roughly 8.5 kilobytes (avgrq-sz * 512 bytes), while the average wait time 27.96 milliseconds (await).

How to use the information?

Out of all these metrics, one is definitely more important than others. The average wait time says how much time an average I/O request spent both waiting in queue and executing, which you should read as how long my database may had to wait on a disk access. It's the response time. This is only an average, so by definition not a very accurate information, however in most cases it tells enough to recognize a potential problem. For example, nearly thirty milliseconds could be a lot if one expects queries to return within only a few or less. The queries that find information buffered in memory may still meet the expectations, however the disk-bound ones could very well become many times slower.

It's important to relate the response time to the capabilities of the storage on which it is measured. You should not expect the same range of values from an EBS volume as from a directly-attached RAID array.

Newer versions of iostat provide even more useful information by splitting await into two separate columns: one for reads (r_await) and one for writes (w_await). This way you can see how each type of activity contributes to the global average, but more importantly, it helps to understand more precisely where the problems may be. Unfortunately this version is not available on many Linux distributions, at least not by default.

But there is another way to get the extra insight. A script called pt-diskstats from Percona Toolkit reads information from /proc/diskstats directly and generates an output that includes these more detailed statistics.

bash-4.2 # pt-diskstats -d 'drbd0$' -c 'rd_s|rd_rt|rd_cnc|wr_s|wr_rt|wr_cnc'

#ts device   rd_s rd_cnc   rd_rt    wr_s wr_cnc   wr_rt
{1} drbd0     7.9    0.0     1.0    62.2    0.9    15.1
{1} drbd0     4.6    0.0     1.1    57.9    1.4    24.2
[..]

Note: The example uses pt-diskstats 1.0. The current version is 2.1 and implements a different naming scheme for the command line parameters, however I discovered the newer versions sometimes used to produce incorrect results and therefore I postponed my upgrade until later time.

The output is a bit different from iostat, but it shows the same information. rd_rt and wr_rt are response times of the respective I/O request types and they carry the same information as await.

The next step would be looking at how many I/O operations happen every second. If the sum of r/s and w/s is close to the storage theoretical capacity, it can very well explain the elevated await times. In order words, this could mean the system is running out of the I/O capacity. Being able to identify such situation is important as it can save you from investigating a performance problem when its direct cause is just in front of your eyes. This is why you should always benchmark a storage before it is put to use, although please remember that you need to benchmark for IOPS rather than for throughput.

Afterwards it may also be worth checking at how much data is being pushed in or out. A very long sequential read or write may need significantly more time compared to a much shorter one, so a larger avgrq-sz could explain the extended waits. Until you learn what values make sense and which look odd, this is where a good graphing system comes in handy as it would enable you to compare the historical sizes to current ones. In practice, however, this metric is mostly useful in discovering problems that are not coming from database itself as the only sequential I/O it can be doing are read-aheads and they aren't very common. A running backup, some runaway rsync - these are more typical examples of things that could noticeably affect it. Related values are shown in rkB/s (or rsec/s) and wkB/s (or wsec/s), although they show the accumulated size of reads or writes respectively rather than just per-request average.

One last item is %util. It shows what percentage of the interval the block device was busy serving requests. People often tend jump to conclusions about a storage performance only by looking at this number, because it is easy. I almost never look at it. The value here can sometimes be very misleading as Linux does not recognize that a block device it sees, may be a logical RAID volume that down below consists of multiple physical disks. Therefore any parallel I/O execution will never be reflected properly in this metric, so while it can be used as a hint in certain cases (i.e. when it is close to 100%), you should always verify the actual utilization through other metrics. The simplest way is calculating the real value through the following formula:

concurrency = (r/s + w/s) * (svctm / 1000) %util = concurrency * 100%

There are two important pieces of information in here. One is that request concurrency can be calculated. Knowing the value may become important when figuring out problems with I/O requests serialization which occurs in applications (e.g. MySQL working as replication slave where queries execute in a single thread). Or it may show that the concurrency is higher than the number of disks in an array, which would imply the storage cannot keep servicing incoming requests. And the second is that utilization can in fact be higher than 100% with a storage built from multiple disks, which is what iostat would never show.

Working with the information

In the final section, let's go through an example.

Good graphs can be useful. I like to graph the iostat or pt-diskstats output over longer periods, a few minutes at least, because it allows seeing patterns and spotting problems without even bothering yourself with looking at numbers until there is actually a reason.

What we can see is that write response times are quite high and periodically spike to 200 milliseconds or even higher, while at the same time reads seem to be doing well. There appears to be no unusual activity around the spikes, though. The I/O utilization remains in a reasonable range of 100 - 150 IOPS, which is something even a single disk should handle without delays.

With directly-attached, ordinary spinning disks the response times should hardly ever cross the ten millisecond threshold. That limit is sufficient for most disks that are commonly installed in database servers. Assuming a RAID array with some write-back caching, write response time shouldn't even be anywhere near that, because everything should be going straight to a much faster storage - the cache memory. But for some reason we see 15ms or more.

The Throughput and Requests Size graph doesn't add anything interesting. Nothing there can be correlated with the increased response times.

However when we zoom in on one of the spikes in the I/O Utilization and Performance graph, we can make one interesting observation. Reads not only work without any issues, but seem completely unaffected by the problem (blue and red lines on the graph). Why?

We should focus on the block device for a moment. The statistics were pulled from a DRBD volume, which holds the MySQL data files. DRBD is a block device driver for redundant storage, working on top of an another block device. The driver intercepts writes and synchronously sends modified blocks to another system, where another "instance" of DRBD mirrors the change onto its own underlying storage. Synchronously means that it blocks an I/O request until the write completes in both locations. From there we can quickly conclude that reads are fine, because they always come from the directly-attached disk. Writes are slower, because they need to be synchronized over the network to the secondary storage. So perhaps it is the other server that isn't performing well, or maybe there are problems on the network layer, or maybe DRBD has been poorly configured... Of course, at this point all we have are guesses, but together with the data collected, these guesses set a pretty good direction for further investigation.

Blog