I am a big fan of “iostat -x” for a first look at how a Solaris/Illumos system’s storage is doing. A lot of third party performance tools (e.g. Zabbix, BMC, SCOM, etc) also use iostat or the same underlying kstats. However, when a system has more than just a few disks/LUNs the stats become hard to read and costly to store, so the temptation is to sample them less frequently. Indeed, it is not unusual for iostats to be sampled at 15 seconds or greater.
Here’s an example of “iostat -x 15” output:
device r/s w/s kr/s kw/s wait actv svc_t %w %b sd40 0.0 1.5 0.0 29.1 0.0 0.0 0.2 0 0 sd41 12.3 0.0 1735.2 0.0 0.0 0.1 7.0 0 9 sd42 12.2 0.0 1539.1 0.0 0.0 0.1 7.2 0 9 sd43 4.0 13.6 658.6 2004.9 0.0 0.1 2.9 0 5 sd44 3.5 14.7 640.9 2126.1 0.0 0.1 3.1 0 6 sd45 3.3 11.8 599.4 2097.4 0.0 0.1 3.6 0 5 sd46 3.3 11.1 547.2 2248.4 0.0 0.1 3.8 0 5 sd47 3.9 11.3 684.9 2248.4 0.0 0.1 4.0 0 6 sd48 3.3 11.3 582.3 2097.4 0.0 0.1 3.8 0 6 sd49 5.2 9.2 719.8 2119.8 0.0 0.1 3.7 0 5 sd50 2.9 9.5 462.9 2119.8 0.0 0.0 3.7 0 5
There’s plenty of useful information in there. However, the provided kstats do not differentiate between read and write latency. This means, for example, that many writes to a LUN which has a write cache could mask poor read latencies. So, for some time I have been using DTrace to provide IO statistics that look more like this …
time r/s Kr/s us/r w/s Kw/s us/w 115 66 7659 4354 257 33077 3535 130 54 4753 5057 313 41208 4225 145 81 10056 3986 263 30093 1028 160 99 12923 4881 302 33146 5404 175 44 4807 4161 318 41669 1638 190 31 4017 3390 267 32117 1591 205 40 4237 3782 267 31671 1186 220 385 44153 7386 300 37612 9552 235 325 38326 6620 252 29397 1217 250 99 13185 6507 415 49924 9394
Here the basic statistics (IOPS, bandwidth and latency), differentiated by direction, are aggregated across a set of LUNs with the addition of a handy timestamp (in seconds). However, with a 15 second interval things are still pretty boring, so here’s how the “160” data looks with 1 second samples …
time r/s Kr/s us/r w/s Kw/s us/w 146 7 896 2449 186 3458 5980 147 13 1416 3815 96 756 48367 148 127 16319 3666 1 0 13 149 42 5254 3738 0 0 0 150 1 128 509 1 0 28 151 15 1676 2522 4 16 30 152 22 2816 3645 49 432 6385 153 48 5920 6314 0 0 0 154 23 2832 17770 3918 488337 18345 155 306 37275 6016 138 2404 180 156 473 62128 9277 0 0 0 157 7 896 540 143 1631 1678 158 24 2838 3176 0 0 0 159 39 4637 4265 0 0 0 160 347 48820 5523 4 165 67
Some will recognise the signature “breathing” associated with ZFS transaction group writes (i.e. when data from non-synchronous writes is periodically flushed to disk).
The system under test serves files via NFS over a 10Gbps Ethernet, and clients complain of occasional 100-500ms spikes in read latency. However, although the above 1 second samples show a small increase in mean read latency, it doesn’t look like much to worry about.
The system is connected to a few dozen fast drives via a 4x 6Gbps SAS link. The maximum theoretical bandwidth is, therefore, 24Gbps (i.e 3GBps). Again, the 1 second sample showing nearly 500MBps doesn’t seem that significant.
But now let’s go where iostat cannot, zooming in on the “154-144” interval with 10Hz sampling …
time r/s Kr/s us/r w/s Kw/s us/w 153.1 10 1280 3875 0 0 0 153.2 0 0 0 0 0 0 153.3 10 1280 5506 0 0 0 153.4 0 0 0 0 0 0 153.5 0 0 0 0 0 0 153.6 100 12800 4829 7550 331625 3194 153.7 10 1280 4082 13240 1993760 22143 153.8 20 2560 136888 4150 2171420 82941 153.9 40 5120 13776 1350 286795 73533 154.0 40 4000 8753 12890 99770 1640 154.1 80 10240 6460 1140 23800 879 154.2 30 1450 6748 240 240 923 154.3 20 2560 7926 0 0 0 154.4 200 25600 11560 0 0 0 154.5 0 0 0 0 0 0 154.6 1420 179665 9317 0 0 0 154.7 1300 151960 8097 0 0 0 154.8 0 0 0 0 0 0 154.9 0 0 0 0 0 0 155.0 10 1280 10052 0 0 0
And there it is! In just 200ms we see 400MB written (i.e. at over 2GBps), at which point mean read latency peaks at 137ms. So ZFS’s transaction group writes appear to be starving reads, causing latency spikes invisible with 1 second sampling. We will investigate this further in a subsequent blog post.
Yes, there are many ways to have reach this point with DTrace, however this scenario does serve to demonstrate one of them. Perhaps the main point is that 1 second sampling hides far too much from view (and calls into question the usefulness of the much larger intervals favoured by so many).
Hi , Thanks a lot for this Blog. Currently I’m trying to sample hadoop reads/writes on SATA vs Fusion I/O and I couldnt get a clear picture of the Read/Write bandwidth because the least sampling interval is 1 sec (for iostat / sar ).
So How do I get the 10HZ sampling interval like you did here with dtrace. It would be awesome if you could give me some pointers.
Thanks,
Shaum
I am very interested in your Dtrace io report. Would you be willing to share the source code for that. I would like to be able to measure write latency separate from read latency, but iostat will not provide those statistics. Thank you!
Hello!
Could you tell what dtrace probes you use for iostat-like script?
Thanks!