Beyond iostat 1 second samples with DTrace

I am a big fan of “iostat -x” for a first look at how a Solaris/Illumos system’s storage is doing. A lot of third party performance tools (e.g. Zabbix, BMC, SCOM, etc) also use iostat or the same underlying kstats. However, when a system has more than just a few disks/LUNs the stats become hard to read and costly to store, so the temptation is to sample them less frequently. Indeed, it is not unusual for iostats to be sampled at 15 seconds or greater.

Here’s an example of “iostat -x 15” output:

device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
sd40      0.0    1.5    0.0   29.1  0.0  0.0    0.2   0   0
sd41     12.3    0.0 1735.2    0.0  0.0  0.1    7.0   0   9  
sd42     12.2    0.0 1539.1    0.0  0.0  0.1    7.2   0   9  
sd43      4.0   13.6  658.6 2004.9  0.0  0.1    2.9   0   5  
sd44      3.5   14.7  640.9 2126.1  0.0  0.1    3.1   0   6  
sd45      3.3   11.8  599.4 2097.4  0.0  0.1    3.6   0   5  
sd46      3.3   11.1  547.2 2248.4  0.0  0.1    3.8   0   5  
sd47      3.9   11.3  684.9 2248.4  0.0  0.1    4.0   0   6  
sd48      3.3   11.3  582.3 2097.4  0.0  0.1    3.8   0   6  
sd49      5.2    9.2  719.8 2119.8  0.0  0.1    3.7   0   5  
sd50      2.9    9.5  462.9 2119.8  0.0  0.0    3.7   0   5  

There’s plenty of useful information in there. However, the provided kstats do not differentiate between read and write latency. This means, for example, that many writes to a LUN which has a write cache could mask poor read latencies. So, for some time I have been using DTrace to provide IO statistics that look more like this …

    time         r/s     Kr/s     us/r         w/s     Kw/s     us/w
     115          66     7659     4354         257    33077     3535
     130          54     4753     5057         313    41208     4225
     145          81    10056     3986         263    30093     1028
     160          99    12923     4881         302    33146     5404
     175          44     4807     4161         318    41669     1638
     190          31     4017     3390         267    32117     1591
     205          40     4237     3782         267    31671     1186
     220         385    44153     7386         300    37612     9552
     235         325    38326     6620         252    29397     1217
     250          99    13185     6507         415    49924     9394

Here the basic statistics (IOPS, bandwidth and latency), differentiated by direction, are aggregated across a set of LUNs with the addition of a handy timestamp (in seconds). However, with a 15 second interval things are still pretty boring, so here’s how the “160” data looks with 1 second samples …

    time         r/s     Kr/s     us/r         w/s     Kw/s     us/w
     146           7      896     2449         186     3458     5980
     147          13     1416     3815          96      756    48367
     148         127    16319     3666           1        0       13
     149          42     5254     3738           0        0        0
     150           1      128      509           1        0       28
     151          15     1676     2522           4       16       30
     152          22     2816     3645          49      432     6385
     153          48     5920     6314           0        0        0
     154          23     2832    17770        3918   488337    18345
     155         306    37275     6016         138     2404      180
     156         473    62128     9277           0        0        0
     157           7      896      540         143     1631     1678
     158          24     2838     3176           0        0        0
     159          39     4637     4265           0        0        0
     160         347    48820     5523           4      165       67

Some will recognise the signature “breathing” associated with ZFS transaction group writes (i.e. when data from non-synchronous writes is periodically flushed to disk).

The system under test serves files via NFS over a 10Gbps Ethernet, and clients complain of occasional 100-500ms spikes in read latency. However, although the above 1 second samples show a small increase in mean read latency, it doesn’t look like much to worry about.

The system is connected to a few dozen fast drives via a 4x 6Gbps SAS link. The maximum theoretical bandwidth is, therefore, 24Gbps (i.e 3GBps). Again, the 1 second sample showing nearly 500MBps doesn’t seem that significant.

But now let’s go where iostat cannot, zooming in on the “154-144” interval with 10Hz sampling …

    time         r/s     Kr/s     us/r         w/s     Kw/s     us/w
   153.1          10     1280     3875           0        0        0
   153.2           0        0        0           0        0        0
   153.3          10     1280     5506           0        0        0
   153.4           0        0        0           0        0        0
   153.5           0        0        0           0        0        0
   153.6         100    12800     4829        7550   331625     3194
   153.7          10     1280     4082       13240  1993760    22143
   153.8          20     2560   136888        4150  2171420    82941
   153.9          40     5120    13776        1350   286795    73533
   154.0          40     4000     8753       12890    99770     1640
   154.1          80    10240     6460        1140    23800      879
   154.2          30     1450     6748         240      240      923
   154.3          20     2560     7926           0        0        0
   154.4         200    25600    11560           0        0        0
   154.5           0        0        0           0        0        0
   154.6        1420   179665     9317           0        0        0
   154.7        1300   151960     8097           0        0        0
   154.8           0        0        0           0        0        0
   154.9           0        0        0           0        0        0
   155.0          10     1280    10052           0        0        0

And there it is! In just 200ms we see 400MB written (i.e. at over 2GBps), at which point mean read latency peaks at 137ms. So ZFS’s transaction group writes appear to be starving reads, causing latency spikes invisible with 1 second sampling. We will investigate this further in a subsequent blog post.

Yes, there are many ways to have reach this point with DTrace, however this scenario does serve to demonstrate one of them. Perhaps the main point is that 1 second sampling hides far too much from view (and calls into question the usefulness of the much larger intervals favoured by so many).

Leave a Reply

Your email address will not be published. Required fields are marked *