If Linux is faster, it’s a Solaris bug!

Well that’s what I said back in 2002 — and it certainly had an impact within Sun — but I see Darrin has toned it down a little in his inspirational list of performance mantras. Somehow “if a competing OS is faster, it’s a Solaris bug and it’s important” doesn’t have quite the same punch!

I still stand behind my original call to action. And I’m glad to report that the gap has been closing and continues to close. Of course there will always be cases where the RAS (reliability, availability, and serviceability) or scalability of one platform make it slower than a competitive but less capable platform. However, the days of attributing all performance vices to other mitigating virtues are long gone. Quite often code is slow simply because it is slow and needs tuning.

I’ve already promised to blog about my lockless scalability fix for getenv(3C) (and I will do so soon). This is just one instance of me putting my putback where my slogan is. Although in this case it would have been possible to take the moral high ground since the other platform’s implementation was only faster because it was completely multithread unsafe — it may still be so for all I know, but I do know that Solaris 10 now has a fast and safe implementation.

Is it too cheeky to suggest a corollary? …

“If Solaris has a cool distinguishing feature, it’s a Linux bug!”

Tag: ,

 

Time, Timers, Timeouts and Ticks – Part 1

One annoying thing about DTrace is that there is now generally no good excuse for not finding out what really is going on whenever a puzzling system statistic rears its ugly head. In the past I would just shrug off such gaps in my understanding and move onto something “more imporant” (i.e. less soul destroying). With the old tools the “return on investment” was often just too low, but with DTrace the ROI really can be significant: not only is my understanding improving, but I’m finding new stuff to “go fix” all the time!

Jon Haslam and I were sipping coffee in a Washington DC hotel lobby, hacking together some examples for our demo session at the SUPerG 2005 conference. I was simulating a thundering herd scenario on my Acer Ferarri laptop by having 100 threads in a process doing this:

for(;;)
poll(0, 0, 10);

I’d done this kind of thing many times before, but this time I happened to notice that I wasn’t getting my expected number of syscalls or context switches, as reported by good old vmstat(1M). By some strange quirk I was seeing only about 6,000 additional syscalls per second, but context switches were up by 10,000 per second (where I naively expected the syscalls to be). Of course, I now realise that I should have expected no more than 5,000 poll(2) syscalls per second.

I will presume the reader is pretty used to the idea that Solaris rounds most timers and timeouts (I’m not going to cover CLOCK_HIGHRES timers here) up to the next clock tick. The default clock frequency is 100Hz (and has been as long as I can remember, although from Solaris 7 onwards this has been configurable) which equates to a tick interval of 10ms.

This means that the following code segment running in a single thread will generate no more that 100 poll(2) syscalls per second:

for (;;)
poll(0, 0, 9);

Although we’re asking for 9ms timeouts these will get rounded up to the next 10ms tick. On a quiet system we can expect the first call to synchronise us with the clock after one or two ticks. From then on all subsequent calls should only require one tick each (with nearly 1ms to spare for call and implementation overhead).

Likewise we can fairly safely predict that following code segment will generate no more than 50 syscalls per second:

for (;;)
poll(0, 0, 11);

When we come to a 10ms timeout it is fairly safe to predict (based on the above) that the following code should generate somewhere between 50 and 100 syscalls per second tops:

for (;;)
poll(0, 0, 10);

Of course we can only hit 100 syscalls per second if the C langauge overhead and the poll(2) syscall implementation overhead were both zero. On reflection, this obviously cannot be the case!

A second guess would be that all 10ms sleeps must take more than 10ms elapsed time, so we might expect just 50 syscalls per second maximum. However in the singlethreaded case I observe about 60 on my system (and this accords with the 6,000 I saw from 100 threads).

At this point it is important to add that I have verified that none of the poll(2) syscalls are returning early – i.e. all are taking at least 10ms.

Right now my best guess is that the gap between two consecutive clock ticks reaching the timer callout queue sometimes exceeds 10ms. Of course if this happens it is very likely that the very next tick will reach the callout queue in less than 10ms, but this too would be consistent with my observations.

To summarise. I currently believe that something like this is happening:

poll() tick0 tick1 tick2 tick3 tick4 tick5 tick6 tick7
9ms wakeup wakeup wakeup wakeup wakeup wakeup wakeup
10ms wakeup wakeup wakeup wakeup
11ms wakeup wakeup wakeup

In the 10ms example above I’m guessing that the gap between tick4 and tick5 arriving at the timer callout queue is slightly more than 10ms. This results in two consecutive 10ms wakeups. Since I’m seeing 60 syscalls where I would expect 50, I can further surmise that this kind of thing happens about 20% of the time.

Whilst I did use some trivial DTrace scripts to verify some of the above data (e.g. the number of actual poll(2) syscalls per second) most of this post is intended to set the scene and whet your appetite for Part 2.

Next time I hope to explain how I used DTrace to find the reason for the excessive context switches I observed. This led to me filing a performance bug and opened up a host of additional questions to do timers, timeouts and ticks and how they relate to the time of day.

Tag: ,

Initial Thread

Hopefully this won’t become YASCRIBE (yet another self conscious readerless initial blog entry), but so far it’s not looking that promising, is it?

After 16+ years at Sun I’m still having plenty of fun, so maybe Multiple Threads will allow me to spread some of the fun around? Of course, your idea of fun may differ significantly from mine. But here goes anyway …

Multithreading has been my thing for the past decade, and so much has been happening in this space! Here are some of the highlights:

  • Solaris 2.0 (1992) multithreaded kernel (for scalable SMP)
  • Solaris 2.2 (1993) libthread (for multithreaded applications, and based on an MxN model)
  • Solaris 2.5 (1995) libpthread (for applications written to the Pthread APIs)
  • Solaris 2.6 (1997) preemption control, scheduler activations, adaptive mutexes (for performance)
  • Solaris 7 (1998) libmtmalloc (for improved heap performance), 64-bit address space (for lots more threads!)
  • Solaris 8 (2000) alternate libthread implementation based on a 1-1 model (more scalable, improved semantics, simpler to maintain and extend)
  • Solaris 9 (2002) removed MxN implementation altogether, added Thread Local Storage support (allows C compilers to implement __thread directive efficiently)
  • Solaris 10 (2005) rolled libthread into libc (i.e. threads are now first class citizens), added various POSIX APIs (e.g. posix_spawn() and spin locks)

My technical whitepaper Multithreading in the Solaris Operating Environment reviews most of the above in much greater detail — up to and including Solaris 9 (being written for that release). I hope to use this blog to fill in many of the gaps concerning the exciting stuff we have done since then, not least what we did to make threads first class citizens in Solaris 10.

I hope to be back before too long to share more of this really interesting stuff — probably starting with how I made getenv(3C) lockless in Solaris 10. So watch this space!

Tag: ,