libMicro: we scare because we care

In some ways, libMicro was a reaction to LMbench (which Bart Smaalders and I considered unscientific and a pain in the neck), but we really wanted to write a useful tool which could produce compelling data to drive improvements in Solaris. The result has exceeded our expectations dramatically. Not only has libMicro produced data for many “Linux is faster than Solaris at xxx” bugs, but it also kick-started Sun’s interest in the AMD Opteron processor (as well as helping the adoption of SPARC64).

libMicro also has the distinction of being one of the first open source projects at hosted under Mercurial on the opensolaris.org collaboration website. It is still used extensively within Sun, and the code has also proven to be a useful reference for those wanting to write multithreaded applications. Today libMicro can be found alive and well here, and even our competitors are using it!

PRISM and the patent

Before Solaris could have large page support for program text and data, we needed a business case. PRISM stands for Process Relocation in Intimate Shared Memory, and was my first big innovation whilst in PAE. The idea is simple: stop the process, copy a region of small pages somewhere, unmap the source region, remap the source region with large pages, copy the data back, and then allow the process to continue. At the time ISM was the only source of large pages.

My first solution used the LD_PRELOAD shared library interposition technique, but quickly moved on to LD_AUDIT interposition because this provides more fine-grained control. Operating at process startup (with the inclusion of an optional dummy malloc() and free() to preallocate the heap before the relocation took place), PRISM generated plenty of useful data to fuel the MPSS and Large Pages OOB projects. It also highlighted the usefulness of local copies of readonly text and data for large scale NUMA machines.

The PRISM library helped some of our published CPU benchmark numbers, and so had to be shipped with some versions of our compilers. This triggered the patent filing process, with my patent finally being awarded a year or so later.

About five years later, with MPSS and Large Pages OOB in place, I revisited the PRISM idea with Shatter, a tool to break up large pages into smaller ones. This contributed part of Nicolai Kosce’s dataspace profiling initiative (DProfile), which was trying to understand the effect of page colouring on performance.

A brief history of threads

Before joining PAE (Performance and Availabilty Engineering), I worked with a major european database vendor on their kernel scalability (on behalf of a mutual customer, a leading media company). We were fighting limitations in an aging implementation of Sun’s pioneering two-level thread model (something which became known as “old and broken libthread”). During one of my OS Ambassador trips, I visited Bryan Cantrill and Roger Faulker, and discovered that Bryan had sketched and Roger had prototyped a new implementation based on a one-level model. I then used the customer as the business case for introducing the one level “alternate” implementation in Solaris 8 (under /usr/lib/lwp).

By the time I joined PAE, the new implementation had gained quite a reputation for fixing scalability and stability problems with many multithreaded applications. PAE had many fans of the two-level concept, so I found myself immediately in conflict with some of my new colleagues. But I stuck to my guns and was able to win most of them over to the one-level model. I then worked with Roger, Bart Smaalders and others to have make the one-level model the only implementation in Solaris 9. Part of my contribution to this effort was to write the technical whitepaper Multithreading in the Solaris Operating Environment:

  • The original version on www.sun.com [pdf]
  • The revised version as presented at SUPerG [pdf]

This paper has become a widely quoted document of how we do multithreading, and is still relevant today. Of course, the new thread implementation paved the way for Roger’s 1600 file putback to unify the Solaris process model, making threads first class citizens in Solaris – something Linux may actually never achieve!

Education

I don’t consider myself an academic, but I did manage a “Desmond” honours degree in Microelectronics and Computing from the University of Wales, Aberystwyth. It was there that I first fell in love with BSD UNIX (on a VAX 11/750), and there that I saw my first Sun workstation (a 2/120, although I was never allowed to use it).

I have since maintained an active interest in the education market, because I feel it is a natural recruiting ground for future Sun employees and customers. Indeed, I have recruited at least three people into Sun from Aberystwyth alone. Over the years I have worked with the Universities of Aberystwyth, Bangor, Bradford, Dundee, Durham, Leeds, Liverpool, Manchester, Oxford, St Andrews, Salford, Warwick and York, most recently on Sysadmin day conferences in Aberystwyth and Manchester.

Job rotations

An invaluable part of my personal development at Sun has been the SE Job Rotation programme, which was run by Barabara Hill (or “Mom” as she was affectionately known by many of us “on rotation”) …

  • Siebel scalability and load balancing (MDE, 2 weeks)
  • Oracle 7 on Solaris x86 and Windows NT (OPG, 2 weeks)
  • BaaN scalability (MDE, 9 weeks)
  • Many users project (PAE, 4 weeks)

These job rotations gave me the opportunity to learn alongside thought leaders such as Adrian Cockcroft, Allan Packer, Bob Larson, Brian Wong, Dan Powers, Jim Mauro, Mike Briggs and Richard McDougall. They also gave me my first real contact with Solaris engineering, and paved the way for some of the jobs I’ve moved through over the years.

The GORB (Giga Object Request Broker)

One of the first solutions designed to make full use of the 64 CPUs and 64GB of the Enterprise 10000 … in a single process … using threads. This was a collaborative R&D project with a large telco, which resulted in at least one patent being filed. My role was to deliver the multithreading expertise needed to make this fly.

Solving solitaire with a sledgehammer

I’m only including this example because it was so outrageous! A government agency was investigating various hardware platforms for (what I guess was) HPC cryptographic applications. Obviously, they were not sharing any of their actual code, but instead specified a number of number crunching challenges for large scale multiprocessors. I took up the challenge of Solitaire with a 16-way E6000 and more than 8GB of RAM.

    o o o           * * *           o o o           5 6 7          5 6 7
    o o o           * * *           o o o           2 3 4          2 3 4
o o o o o o o   * * * * * * *   o o o o o o o       0 1        7 4 0 1 0 2 5
o o o o o o o   * * * o * * *   o o o * o o o                  6 3 1 x 1 3 6
o o o o o o o   * * * * * * *   o o o o o o o                  5 2 0 1 0 4 7
    o o o           * * *           o o o                          4 3 2
    o o o           * * *           o o o                          7 6 5
    Fig.1           Fig.2           Fig.3           Fig.4          Fig.5

In Figs.1-3 a “o” represents a hole, and a “*” a marble in a hole. Thus, Fig.1 shows the empty board, Fig.2 the staring position (32 marbles), and Fig.3 the target end position. My solution was to encode each board position as 33 bits, with 0 for a hole and 1 for a marble. Threads in a worker pool took known board positions from a work pile and found all possible news moves, which were then added to the work pile. Exploiting rotational and reflectional symmetry, each new board position becomes up to eight possible board positions.

By coding the 33 bits as shown in Figs.4-5 I was able to make rotation a simple byte swap. Reflection is harder, but only needs to be done once (since the remaining three reflections and be achieved by rotating the first reflection). The really extravagant part was adding an 8GB char array to record board positions that had already been seen (and which, therefore didn’t require to be explored again).

The result was a solution in just 2 seconds, with all possible board positions being found within 30 seconds. Sun hardware and my expertise in multithreading has moved on a lot in the intervening years. I’m now itching to try the exercise again on an T5220!

The WOT (Wall Of Terminals)

The Wall Of Terminals was a high-tech response to something a competitor was doing in their benchmarking centre with 50 dumb terminals, namely showing the state of simulated users during large scale multiuser testing.

My design consisted of the following hardware:

  • a custom built piece of furniture
  • six dual-headed SPARCstation 5 workstations
  • two triple-headed SPARCstation 10 workstations
  • eighteen premium 21 inch Sun monitors

and the following software:

    • scripts for building cloned diskless boot environments for the eight workstations
    • a CDE application which multiplexed up to 48 DtTerm widgets in one window
    • a multithreaded application routing up to 864 pseudo terminals across 18 instances of the above
    • a Java applet to reconfigure the number of DtTerm widgets displayed on the fly, and to select one to zoom

The WOT was instrumental in winning a huge MRP deal in the aerospace industry, but also proved very useful as a collection of eighteen X11 screen for displaying just about any benchmark data. My WOT also won an innovation award at the second Sun Technical Symposium in San Francisco.

Oracle, poll() and nanosleep()

Whilst on my first SE Job Rotation to Mountain View, I noticed that Oracle was using poll(0,0,10) as a “cheap” sleep mechanism (it is, after all, cheaper than the usleep() SIGALRM dance). However, at that time Solaris had a clunky implementation of poll() which did not scale well (see below). My solution was to write an interposing shared library (something very new in those days) which mapped poll() onto nanosleep() for the sleep only case. We told our Oracle engineering contacts about what I had done, but heard nothing. About a year later I was running truss on Oracle, and noticed that they had implemented my recommendation.

mt.telnetd

One of the early challenges for systems like the SPARCcenter2000 was that Solaris only supported 48 TELNET users by default. But as this limit was removed, Solaris would die from heavy lock contention triggered by in.telnetd in poll(). Whilst on my first SE Job Rotation, it seemed to me that the obvious solution was to recode the TELNET service using a STREAMS module. At the time I didn’t have the kernel skills to do this, but I knew someone who did. However, they needed a business case to do the work. I implemented a poll()-free TELNET service using the newly available user-level threads library (i.e. one process handling hundred of TELNET sessions). My data provided the business case, and in.telnetd was given its STREAMS implementation.