Niagara 2 memory throughput according to libMicro

libMicro is a portable, scalable microbenchmarking framework which Bart Smaalders and I put together a little while back. It is available to the world via the OpenSolaris website, under the CDDL license, so there really is nothing to stop you recreating the data in this posting.

Although designed for testing individual APIs, the libMicro framework has proven useful for other investigations. For instance, the memrand case, which does negative stride pointer chasing, can be configured to test processor and cache and memory latencies…

huron$ bin/memrand -s 128m -B 1000000 -C 10
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1   1      0.15232           12        0  1000000 134217728
huron$

The above shows 12 samples (we asked for at least 10) of 1,000,000 negative stride pointer references striped across 128MB of memory. The platform is a Sun SPARC Enterprise T5220 server, with a UltraSPARC T2 processor running at 1.4GHz. This simple test indicates a memory read latency of 152ns.

But we can also use libMicro’s multiprocess and multithread scaling capabilities to extend this test to measure memory throughput scaling…

huron$ for i in 1 2 4 8 16 32 64; do bin/memrand -s 128m -B 1000000 -C 10 -T $i; done
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1   1      0.15223           12        0  1000000 134217728
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1   2      0.15176           12        0  1000000 134217728
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1   4      0.15208           12        0  1000000 134217728
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1   8      0.25472           12        0  1000000 134217728
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1  16      0.26242           12        0  1000000 134217728
prc thr   usecs/call      samples   errors cnt/samp     size
memran        1  32      0.24964           12        0  1000000 134217728
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1  64      0.24063           12        0  1000000 134217728
huron$

This shows that up to 4 concurrent threads see 152ns latency, with 64 threads (i.e. full processor utilisation) seeing 240ns latency, which equates to a throughput of 267 million memory reads per second (i.e. 64 / 0.240e-6). Just to set this in context, here are some data for a quad socket Tigerton system running at 2.93GHz…

tiger$ for i in 1 2 4 8 16; do bin/memrand -s 128m -B 1000000 -C 10 -T $i; done
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1   1      0.15559           12        0  1000000 134217728
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1   2      0.15621           12        0  1000000 134217728
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1   4      0.15667           12        0  1000000 134217728
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1   8      0.17726           12        0  1000000 134217728
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1  16      0.18654           12        0  1000000 134217728
tiger$

This shows a peak throughput of about 86 million memory reads per second (i.e. 16 / 0.186e-6), making the single chip UltraSPARC T2 processor’s throughput 3x that of its quad chip rival. Of course, mileage will vary greatly from workload to workload, but pretty impressive nonetheless, heh?

Leave a Reply

Your email address will not be published. Required fields are marked *