Noop I/O scheduling with SSD storage

September 11, 2012 Technical, General

Solid State Drives (SSDs) are not new in the server world, but they’re seen a somewhat limited takeup due to their high cost per gigabyte of storage. This has been changing as prices continue to drop, and we’re now at the point where SSDs represent a viable option for primary storage of high-value data that is read and written heavily.

While SSDs are a drop-in replacement for traditional hard drives (HDDs) when it comes to servers, their behaviour and performance characteristics are fundamentally different. This means that the disk-access scheduling algorithms that we’ve developed over time, originally designed for rotational media, simply don’t work for SSDs.

We recently observed some interesting behaviour as a result of changing disk access schedulers, which we thought we’d share with you here.

The difference between HDDs and SSDs

Rotational drives are pretty quick at reading and writing data sequentially, which is a function of the speed of rotation. Moving the drive’s read/write head to different parts of the disk, called “seeking”, takes roughly a few milliseconds. SSDs have no moving parts so they take almost no time to seek, and tend to have constant read/write speeds.

HDDs are:

  • Fast to transfer data
  • Not very fast at seeking

SSDs are:

  • Very fast at transferring data
  • Extremely fast at seeking

Our scheduling algorithms attempt to reorder disk access patterns to mitigate the shortcomings of HDDs. This necessarily imposes some latency on I/O requests, but performance generally benefits as a result. The schedulers are based on the assumptions of reasonable transfer speeds and slow seek times.

Effects of the scheduler with SSDs

The default I/O scheduler for Linux is CFQ (Completely Fair Queueing), and is a good all-round scheduler. Common wisdom holds that CFQ is unnecessary or sub-optimal on systems that break the usual assumptions, such as hardware RAID cards equipped with a write-cache, or SSDs. In these cases, the “noop” scheduler is suggested. The noop scheduler does nothing (“NO OPeration”), ignoring the opportunity to optimise our disk accesses.

All our VPS hypervisors have hardware RAID cards with a write-cache, so we set out to see if there were any gains to be had from switching to the noop scheduler. Seeing as we’d need it to roll out any large-scale changes, we used Puppet to help us find some suitable test subjects. By loading a custom Facter fact into our manifests, we could grab all the PCI IDs for RAID cards and see what models were installed in different servers. All the “smart” RAID cards had the same PCI ID, which corresponds to the PERC H700 and H800 hardware. Perfect!

Somewhat to our surprise, switching to the noop scheduler provided no significant benefit whatsoever on our VPS hypervisors. This is okay though, it means that we could roll out the change without fear of adverse effects.

A few other servers have H700/H800 hardware RAID cards, mostly beefy database machines belonging to customers. So we did some testing on those too. It’s not terribly scientific stuff, we change the scheduler for a device (eg: echo noop > /sys/block/sda/queue/scheduler) and then inspect the output of iostat and sar.

On one particularly busy server running MySQL, with SSDs on an H700 card, the effect was drastic. This is the output from sar:

Linux 2.6.32-131.12.1.el6.x86_64 (         06/09/12        _x86_64_        (16 CPU)

                  DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
11:05:01      dev8-16   1970.41  61730.99   9336.31     36.07      3.64      1.85      0.47     92.45
11:06:01      dev8-16   2022.08  53274.16   9267.08     30.93      3.08      1.52      0.44     89.49
11:07:01      dev8-16   1569.45  90892.73   7777.52     62.87      5.91      3.77      0.61     95.58
11:08:01      dev8-16   1622.45  75739.28   7849.50     51.52      4.98      3.07      0.59     95.26
11:09:01      dev8-16   2181.67  46467.75  12094.27     26.84      2.45      1.12      0.37     80.93
11:10:01      dev8-16   4507.35  80452.87  29354.22     24.36      1.16      0.26      0.12     52.59
11:11:01      dev8-16   4739.57  68095.01  30491.92     20.80      1.09      0.23      0.11     52.48
11:12:01      dev8-16   4331.84  85908.80  29801.13     26.71      1.19      0.28      0.13     54.50

The change was made around 11:09, which is averaged into the next minute by sar. What we can clearly see is:

  • An increase in the number of I/O requests serviced per second (tps)
  • Roughly no changes in the number of (512 Byte) sectors read per second (rd_sec/s)
  • A big jump (3-4x) in sectors written per second (wr_sec/s)
  • A substantial drop in the average queue length of requests (avgqu-sz)
  • Big drop in average time for an I/O request to be queued and serviced (await), and to be serviced (svctm)
  • Big drop, roughly half, in CPU time during which I/O requests were being issued to the device (%util)

This is a fantastic difference. Keep in mind that this is a single data point – unfortunately we don’t have enough customers with regularly busy systems to use for testing – but the benefit in this particular instance was immediately apparent. The ideal next step is to liaise with the customer to find out what sort of database activity is benefiting from the change, and perform further analysis.

In this case it’s evident that the system is pining to perform more writes to disk, but the device was probably near saturation as evidenced by the %util column. We suspect that the scheduler is a bottleneck in the disk access path: that SSDs are so fast compared to HDDs that the net effect of the scheduler is now a penalty instead of a benefit.

The change translated into a measureable difference for MySQL, too. The number of queries per second spikes right when the change is made around 11:10. The subsequent drop-off is due to us reverting the change, just to confirm what we were seeing: the noop scheduler appears to be responsible for a 2-3x improvement in queries-per-second on this server.

Rolling out the change

We’d now gathered the information we were interested in: no apparent drawbacks to the noop scheduler on servers with hardware RAID cards, and potential benefits for some other servers if they’re busy.

We made use of our Facter fact to find systems with the appropriate PCI ID, and then applied the noop scheduler to the GRUB config. Too easy!

Know your Rate-monotonic from your BFS, and schedule your life using Earliest Deadline First? We’re hiring.