I don’t consider myself an academic, but I did manage a “Desmond” honours degree in Microelectronics and Computing from the University of Wales, Aberystwyth. It was there that I first fell in love with BSD UNIX (on a VAX 11/750), and there that I saw my first Sun workstation (a 2/120, although I was never allowed to use it).
I have since maintained an active interest in the education market, because I feel it is a natural recruiting ground for future Sun employees and customers. Indeed, I have recruited at least three people into Sun from Aberystwyth alone. Over the years I have worked with the Universities of Aberystwyth, Bangor, Bradford, Dundee, Durham, Leeds, Liverpool, Manchester, Oxford, St Andrews, Salford, Warwick and York, most recently on Sysadmin day conferences in Aberystwyth and Manchester.
An invaluable part of my personal development at Sun has been the SE Job Rotation programme, which was run by Barabara Hill (or “Mom” as she was affectionately known by many of us “on rotation”) …
- Siebel scalability and load balancing (MDE, 2 weeks)
- Oracle 7 on Solaris x86 and Windows NT (OPG, 2 weeks)
- BaaN scalability (MDE, 9 weeks)
- Many users project (PAE, 4 weeks)
These job rotations gave me the opportunity to learn alongside thought leaders such as Adrian Cockcroft, Allan Packer, Bob Larson, Brian Wong, Dan Powers, Jim Mauro, Mike Briggs and Richard McDougall. They also gave me my first real contact with Solaris engineering, and paved the way for some of the jobs I’ve moved through over the years.
One of the first solutions designed to make full use of the 64 CPUs and 64GB of the Enterprise 10000 … in a single process … using threads. This was a collaborative R&D project with a large telco, which resulted in at least one patent being filed. My role was to deliver the multithreading expertise needed to make this fly.
I’m only including this example because it was so outrageous! A government agency was investigating various hardware platforms for (what I guess was) HPC cryptographic applications. Obviously, they were not sharing any of their actual code, but instead specified a number of number crunching challenges for large scale multiprocessors. I took up the challenge of Solitaire with a 16-way E6000 and more than 8GB of RAM.
o o o * * * o o o 5 6 7 5 6 7
o o o * * * o o o 2 3 4 2 3 4
o o o o o o o * * * * * * * o o o o o o o 0 1 7 4 0 1 0 2 5
o o o o o o o * * * o * * * o o o * o o o 6 3 1 x 1 3 6
o o o o o o o * * * * * * * o o o o o o o 5 2 0 1 0 4 7
o o o * * * o o o 4 3 2
o o o * * * o o o 7 6 5
Fig.1 Fig.2 Fig.3 Fig.4 Fig.5
In Figs.1-3 a “o” represents a hole, and a “*” a marble in a hole. Thus, Fig.1 shows the empty board, Fig.2 the staring position (32 marbles), and Fig.3 the target end position. My solution was to encode each board position as 33 bits, with 0 for a hole and 1 for a marble. Threads in a worker pool took known board positions from a work pile and found all possible news moves, which were then added to the work pile. Exploiting rotational and reflectional symmetry, each new board position becomes up to eight possible board positions.
By coding the 33 bits as shown in Figs.4-5 I was able to make rotation a simple byte swap. Reflection is harder, but only needs to be done once (since the remaining three reflections and be achieved by rotating the first reflection). The really extravagant part was adding an 8GB char array to record board positions that had already been seen (and which, therefore didn’t require to be explored again).
The result was a solution in just 2 seconds, with all possible board positions being found within 30 seconds. Sun hardware and my expertise in multithreading has moved on a lot in the intervening years. I’m now itching to try the exercise again on an T5220!
The Wall Of Terminals was a high-tech response to something a competitor was doing in their benchmarking centre with 50 dumb terminals, namely showing the state of simulated users during large scale multiuser testing.
My design consisted of the following hardware:
- a custom built piece of furniture
- six dual-headed SPARCstation 5 workstations
- two triple-headed SPARCstation 10 workstations
- eighteen premium 21 inch Sun monitors
and the following software:
- scripts for building cloned diskless boot environments for the eight workstations
- a CDE application which multiplexed up to 48 DtTerm widgets in one window
- a multithreaded application routing up to 864 pseudo terminals across 18 instances of the above
- a Java applet to reconfigure the number of DtTerm widgets displayed on the fly, and to select one to zoom
The WOT was instrumental in winning a huge MRP deal in the aerospace industry, but also proved very useful as a collection of eighteen X11 screen for displaying just about any benchmark data. My WOT also won an innovation award at the second Sun Technical Symposium in San Francisco.
Whilst on my first SE Job Rotation to Mountain View, I noticed that Oracle was using poll(0,0,10) as a “cheap” sleep mechanism (it is, after all, cheaper than the usleep() SIGALRM dance). However, at that time Solaris had a clunky implementation of poll() which did not scale well (see below). My solution was to write an interposing shared library (something very new in those days) which mapped poll() onto nanosleep() for the sleep only case. We told our Oracle engineering contacts about what I had done, but heard nothing. About a year later I was running truss on Oracle, and noticed that they had implemented my recommendation.
One of the early challenges for systems like the SPARCcenter2000 was that Solaris only supported 48 TELNET users by default. But as this limit was removed, Solaris would die from heavy lock contention triggered by in.telnetd in poll(). Whilst on my first SE Job Rotation, it seemed to me that the obvious solution was to recode the TELNET service using a STREAMS module. At the time I didn’t have the kernel skills to do this, but I knew someone who did. However, they needed a business case to do the work. I implemented a poll()-free TELNET service using the newly available user-level threads library (i.e. one process handling hundred of TELNET sessions). My data provided the business case, and in.telnetd was given its STREAMS implementation.
The second SPARCcenter2000 in the UK went to Liverpool University. They had the full complement of 20 CPUs, and were running Solaris 2.2 (which was not certified beyond 8 CPUs). Their acceptance tests consisted of simulated TELNET traffic from a number of PCs replaying various shell user scenarios (ls, vi, cc, f77, etc). Needless to say, the system didn’t stay up for more than about 10 minutes at a time. I ported their test hardness to EMPOWER so that it could be used by engineering, and it is still used to this day).
Engineering insurers, National Vulcan placed a large order for IPX workstations. I devised a simple, but elegant system for installing and configuring the systems that leveraged Solaris 2’s JumpStart and NIS netgroups (the clever bit is that the target system was Solaris 1). This was then also rolled out a number of other customers (including some universities).
Over the years I was a regular speaker at Sun User Group, and wrote several book reviews for their magazine. My first real speaking engagement ever was a 2 hour slot introducing Solaris 2.0 to system administrators at a joint SUG / UKUUG meeting at Oxford University. I’m not sure what scared me most: the 200+ people in the room, or the fact that Rob Gingell sat through it all, grinning. This launched my public speaking career (and I still prefer to have at least 2 hours)! I have been back to Oxford University many times since, but nothing quite compares with that first “baptism of fire”.