libMicro is a portable, scalable microbenchmarking framework which Bart Smaalders and I put together a little while back. It is available to the world via the OpenSolaris website, under the CDDL license, so there really is nothing to stop you recreating the data in this posting.
Although designed for testing individual APIs, the libMicro framework has proven useful for other investigations. For instance, the memrand case, which does negative stride pointer chasing, can be configured to test processor and cache and memory latencies…
huron$ bin/memrand -s 128m -B 1000000 -C 10 prc thr usecs/call samples errors cnt/samp size memrand 1 1 0.15232 12 0 1000000 134217728 huron$
The above shows 12 samples (we asked for at least 10) of 1,000,000 negative stride pointer references striped across 128MB of memory. The platform is a Sun SPARC Enterprise T5220 server, with a UltraSPARC T2 processor running at 1.4GHz. This simple test indicates a memory read latency of 152ns.
But we can also use libMicro’s multiprocess and multithread scaling capabilities to extend this test to measure memory throughput scaling…
huron$ for i in 1 2 4 8 16 32 64; do bin/memrand -s 128m -B 1000000 -C 10 -T $i; done prc thr usecs/call samples errors cnt/samp size memrand 1 1 0.15223 12 0 1000000 134217728 prc thr usecs/call samples errors cnt/samp size memrand 1 2 0.15176 12 0 1000000 134217728 prc thr usecs/call samples errors cnt/samp size memrand 1 4 0.15208 12 0 1000000 134217728 prc thr usecs/call samples errors cnt/samp size memrand 1 8 0.25472 12 0 1000000 134217728 prc thr usecs/call samples errors cnt/samp size memrand 1 16 0.26242 12 0 1000000 134217728 prc thr usecs/call samples errors cnt/samp size memran 1 32 0.24964 12 0 1000000 134217728 prc thr usecs/call samples errors cnt/samp size memrand 1 64 0.24063 12 0 1000000 134217728 huron$
This shows that up to 4 concurrent threads see 152ns latency, with 64 threads (i.e. full processor utilisation) seeing 240ns latency, which equates to a throughput of 267 million memory reads per second (i.e. 64 / 0.240e-6). Just to set this in context, here are some data for a quad socket Tigerton system running at 2.93GHz…
tiger$ for i in 1 2 4 8 16; do bin/memrand -s 128m -B 1000000 -C 10 -T $i; done prc thr usecs/call samples errors cnt/samp size memrand 1 1 0.15559 12 0 1000000 134217728 prc thr usecs/call samples errors cnt/samp size memrand 1 2 0.15621 12 0 1000000 134217728 prc thr usecs/call samples errors cnt/samp size memrand 1 4 0.15667 12 0 1000000 134217728 prc thr usecs/call samples errors cnt/samp size memrand 1 8 0.17726 12 0 1000000 134217728 prc thr usecs/call samples errors cnt/samp size memrand 1 16 0.18654 12 0 1000000 134217728 tiger$
This shows a peak throughput of about 86 million memory reads per second (i.e. 16 / 0.186e-6), making the single chip UltraSPARC T2 processor’s throughput 3x that of its quad chip rival. Of course, mileage will vary greatly from workload to workload, but pretty impressive nonetheless, heh?