Caring For The Environment – Making getenv() Scale

Although a relatively minor contributor to OpenSolaris I still have the satisfaction of knowing that every Solaris 10 process is using my code. But who in their right mind needs getenv(3C) to scale? Of course if you don’t care about thread safety (as is currently the case with glibc version 2.3.5 — and hence with Linux) your implementation might scale very nicely thankyou!

Sun on the other hand does care about thread safety (and we’ve been doing so for a long time). However we had rather assumed that no one in their right mind would need getenv() to scale, so our implemetation was made thread safe by putting a dirty great mutex lock around every piece of code which manipulates environ. After all, as our very own Roger Faulkner is so fond of saying: “Correctness is a contraint; performance is a goal”. And who cares about getenv() performance anyway?

But Who Really Cares?

Well it turns out that there are some significant applications which depend on getenv() scalability (and which scale wonderfully on Linux … where thread safety is often ignored … they are just very lucky that no one seems to be updating environ whilst anyone else is reading it). So Bart Smaalders filed bug 4991763 getenv doesn’t scale and said he thought it was an excellent opportunity for my first putback. Thanks Bart!

For some time I’ve been saying: “If Linux is faster, it’s a Solaris bug!” but somehow 4991763 didn’t quite fit the bill. Firstly, I think an application which depends on getenv() scalability is broken. Secondly, Linux is just feeling lucky, punk. However, I do firmly believe that we should do all we can to ensure that Solaris runs applications well — even those which really need some tuning of their own. I had also been itching for an chance to explore opportunities for lockless optimisations in libc, so all in all 4991763 was an opportunity not to be missed!

A Complete Rewrite

The existing implementation of getenv(), putenv(), setenv(), and unsetenv() was split across three files (getenv.c, putenv.c and nlspath_checks.c) with the global mutex libc_environ_lock being defined elsewhere. Things had become fairly messy and inefficient so I decided on a complete rewrite.

NLSPATH security checks had introduced yet another global variable and a rather inefficient dance involving a mutex on every {get,put,set,unset}env() call just to ensure that clean_env() was called the first time in. In this instance it was an easy matter to remove the mutex from the fast path by doing a lazy check thus:

static mutex_t                  update_lock = DEFAULTMUTEX;
static int                      initenv_done = 0;
char *getenv(const char *name)
{
    char                    *value;
    if (!initenv_done)
        if (findenv(environ, name, 1, &value) != NULL)
            return (value);
    return (NULL);
}

The test was then repeated under the protection of the mutex in initenv() thus:

extern void                     clean_env();
static void
initenv()
{
    lmutex_lock(&update_lock);
    if (!initenv_done || ... ) {
        /* Call the NLSPATH janitor in. */
        clean_env();
        .
        .
        .
        initenv_done = 1;
    }
    lmutex_unlock(&update_lock);
}

By rolling putenv.c into getenv.c I was able to eliminate the use of globals altogether, which in turn allowed the compiler to produce better optimised code.

Look, No Locks!

But the biggest part of the rewrite was to make the fast path of getenv() entirely lockless. What is not apparent above is that findenv() is entirely lockless.

Various standards define the global environment list pointer:

extern char **environ;

This has to be kept as a consistent NULL terminated array of pointers to NULL terminated strings. However, the standards say nothing about how updates are to be synchronised. More recent standards forbid direct updates to environ itself if getenv() and friends are being used.

Yet the requirement that environ is kept consistent is precisely what we need to implement a lockless findenv(). The big issue is that whenever the environ list is updated, anyone else in the process of scanning it must not see an old value which has been removed, or miss a value which has not been removed.

The traditional approach is to allocate an array of pointers, with environ pointing to the first element. When someone needs to a new value to the environment list we simply add it to the end of the list. But how do we go about deleting values? And what if we need to add a new value when the allocated array is already full? If you care about threads, it’s not long before you need to introduce some locking!

The new implementation contains two “smarts” which meets these challenges without introducing locks into the getenv() path …

Double It And Drop The Old One On The Floor

When the new implementation needs a bigger environ list, it simply allocates a new one which is twice as large and copies the old list into it. The old list is never reused — it is left intact for the benefit of anyone else who might happen to be still traversing it.

This may sound wasteful, but the environment list rarely needs to be resized. The wastage is also bounded — it is quite easy to prove mathematically that this strategy never consumes more than 3x the space required by an array of optimal size.

However, one teeny weeny issue with the “drop it on the floor” approach is that leak detectors can get a tad upset if they find allocated memory which is not referenced by anyone. With a view to keeping me on the straight and narrow — but mostly to avert high customer support call volumes — Chris Gerhard recommended that I keep a linked list of all internally dropped memory (just to keep those goody-goody leak detectors happy).

I first met Chris in 1989. He was on my interview panel when I joined Sun. I do hope he feels he did a good job that day!

Overwrite With The First Entry And Increment

I was bouncing some other getenv() ideas around with Chris when he also gave me just what I needed for deletes. The old code just grabbed the global lock, found the element to be deleted, shifted the rest of the list down one slot (overwriting the victim), and then released the lock.

Chris had the bright idea of copying the first element in the list over the victim, and then incrementing the environ pointer itself. The worst case would be that the same element might be seen twice by another thread, but this is not a correctness issue.

This led to two further changes:

  1. New values are now added at the bottom of the environment list
    (with the environ pointer being decremented once the new value is in place).
  2. When a new doube-sized environment list needs to allocated, the
    old one is copied into the top of the new one (instead of the bottom) so that the list can then be grown downwards.

OK, Not Entirely Lockless

Obviously mutex lock protection is still needed to serialise all updates to the environment list. The new implementation has a lock for all seasons: update_lock (e.g. for updating initenv_done and for protecting environ itself). However the new getenv() is entirely lockless (i.e. once clean_env() has been called once).

Another important issue is that it is considered bad practice forsystem libraries to hold a lock while calling malloc(). For this reason the first two thirds of addtoenv() are inside a for(;;) loop. If it is necessary to allocate a larger environment array addtoenv() needs to drop update_lock temporarily. However this opens a window for another thread to modify environ in such a way that means we must retry. This loop is controlled by a simple generation number environ_gen (also protected by update_lock).

Guaranteeing Consistency

That’s almost all there is to it. However in multiprocessor systems we still have to make sure that memory writes on one CPU happen in such a way that they don’t appear out of sequence on another CPU. Of course this is taken care of automatically when we use mutex locks.

Consider the following code fragment to insert a new item:

environ[-1] = string;
environ--;

It is vitally important the two memory writes implied by this are seen in the same order to every CPU in the system. On SPARC today this doesn’t matter since all ABI compliant binaries run in Total Store Order mode (i.e. stores are guarantted to be visible in the order in which they are submitted). But is it possible that future systems will used a more relaxed memory model.

However, this is not just a hardware issue, it is also a compiler issue. Without extra care the compiler might reorder the two stores, since the C language cares nothing for threads. I had quite a long discussion with a number of folk concerning “sequence points” and the use of volatile in C.

The eventual solution was this:

environ[-1] = string;
membar_producer();
environ--;

First, the function membar_producer() serves as a sequence point, guaranteeing that the C compiler will preserve the order of the preceding and following statements. Secondly, it provides any store barriers needed by the underlying guarantee the same effect as Total Store Order for the preceding and following instructions.

A Spirit Of Generosity

My new implementation was integrated into s10_67 yet despite my own extensive testing it caused a number of test suites to fail in later testing. This was tracked down to common test suite library code which updated environ directly. Yuck! Although this kind of thing is very much frowned on by the more recent standards it was felt that if our own test cases did it there was a good chance that some real applications did it too. So with some reluctance I filed 6183277 getenv should be more gracious to broken apps.

If someone else if going to mess with environ under your nose there’s not a lot you can do about it. However it is fairly easy to detect the majority of cases (except for a few really nasty data races) by keeping a private copy my_environ which can be compared
with the actual value of environ. If these two values are ever found to differ we just go back to square one and try again.

So the above fragment for adding an item now looks more like this:

if (my_environ != environ || ...)
    initenv();
    .
    .
    .
    my_environ[-1] = string;
    membar_producer();
    my_environ--;
    environ = my_environ;

Conclusion

My second putback integrated into s10_71. Following this I had to fight off one other challenge from Rod Evans who filed 6178667 ldd list unexpected (file not found) in x86 environment. However this turned out to be not my fault, but a latent buffer overrun in ldd(1) exposed by different heap usage patterns. Of course, the engineer who introduced this heinous bug (back in January 2001) will remain anonymous (sort of). Still, he did buy me a beer!

Of course, the serious point is that when you change something which is used by everyone, it is possible that you expose problems elsewhere. It is to be expected that the thing last changed will get the blame. Such changes are not for the faint-hearted, but they can be a whole lot of fun!

My first experience of modifying Solaris actually resulted in two putbacks into the Solaris gate, but I learnt a great deal along the way. Dizzy with my success, I am now actively seeking other opportunities for lockless optimisations in libc!

Technorati Tags: ,

If Linux is faster, it’s a Solaris bug!

Well that’s what I said back in 2002 — and it certainly had an impact within Sun — but I see Darrin has toned it down a little in his inspirational list of performance mantras. Somehow “if a competing OS is faster, it’s a Solaris bug and it’s important” doesn’t have quite the same punch!

I still stand behind my original call to action. And I’m glad to report that the gap has been closing and continues to close. Of course there will always be cases where the RAS (reliability, availability, and serviceability) or scalability of one platform make it slower than a competitive but less capable platform. However, the days of attributing all performance vices to other mitigating virtues are long gone. Quite often code is slow simply because it is slow and needs tuning.

I’ve already promised to blog about my lockless scalability fix for getenv(3C) (and I will do so soon). This is just one instance of me putting my putback where my slogan is. Although in this case it would have been possible to take the moral high ground since the other platform’s implementation was only faster because it was completely multithread unsafe — it may still be so for all I know, but I do know that Solaris 10 now has a fast and safe implementation.

Is it too cheeky to suggest a corollary? …

“If Solaris has a cool distinguishing feature, it’s a Linux bug!”

Tag: ,

 

Time, Timers, Timeouts and Ticks – Part 1

One annoying thing about DTrace is that there is now generally no good excuse for not finding out what really is going on whenever a puzzling system statistic rears its ugly head. In the past I would just shrug off such gaps in my understanding and move onto something “more imporant” (i.e. less soul destroying). With the old tools the “return on investment” was often just too low, but with DTrace the ROI really can be significant: not only is my understanding improving, but I’m finding new stuff to “go fix” all the time!

Jon Haslam and I were sipping coffee in a Washington DC hotel lobby, hacking together some examples for our demo session at the SUPerG 2005 conference. I was simulating a thundering herd scenario on my Acer Ferarri laptop by having 100 threads in a process doing this:

for(;;)
poll(0, 0, 10);

I’d done this kind of thing many times before, but this time I happened to notice that I wasn’t getting my expected number of syscalls or context switches, as reported by good old vmstat(1M). By some strange quirk I was seeing only about 6,000 additional syscalls per second, but context switches were up by 10,000 per second (where I naively expected the syscalls to be). Of course, I now realise that I should have expected no more than 5,000 poll(2) syscalls per second.

I will presume the reader is pretty used to the idea that Solaris rounds most timers and timeouts (I’m not going to cover CLOCK_HIGHRES timers here) up to the next clock tick. The default clock frequency is 100Hz (and has been as long as I can remember, although from Solaris 7 onwards this has been configurable) which equates to a tick interval of 10ms.

This means that the following code segment running in a single thread will generate no more that 100 poll(2) syscalls per second:

for (;;)
poll(0, 0, 9);

Although we’re asking for 9ms timeouts these will get rounded up to the next 10ms tick. On a quiet system we can expect the first call to synchronise us with the clock after one or two ticks. From then on all subsequent calls should only require one tick each (with nearly 1ms to spare for call and implementation overhead).

Likewise we can fairly safely predict that following code segment will generate no more than 50 syscalls per second:

for (;;)
poll(0, 0, 11);

When we come to a 10ms timeout it is fairly safe to predict (based on the above) that the following code should generate somewhere between 50 and 100 syscalls per second tops:

for (;;)
poll(0, 0, 10);

Of course we can only hit 100 syscalls per second if the C langauge overhead and the poll(2) syscall implementation overhead were both zero. On reflection, this obviously cannot be the case!

A second guess would be that all 10ms sleeps must take more than 10ms elapsed time, so we might expect just 50 syscalls per second maximum. However in the singlethreaded case I observe about 60 on my system (and this accords with the 6,000 I saw from 100 threads).

At this point it is important to add that I have verified that none of the poll(2) syscalls are returning early – i.e. all are taking at least 10ms.

Right now my best guess is that the gap between two consecutive clock ticks reaching the timer callout queue sometimes exceeds 10ms. Of course if this happens it is very likely that the very next tick will reach the callout queue in less than 10ms, but this too would be consistent with my observations.

To summarise. I currently believe that something like this is happening:

poll() tick0 tick1 tick2 tick3 tick4 tick5 tick6 tick7
9ms wakeup wakeup wakeup wakeup wakeup wakeup wakeup
10ms wakeup wakeup wakeup wakeup
11ms wakeup wakeup wakeup

In the 10ms example above I’m guessing that the gap between tick4 and tick5 arriving at the timer callout queue is slightly more than 10ms. This results in two consecutive 10ms wakeups. Since I’m seeing 60 syscalls where I would expect 50, I can further surmise that this kind of thing happens about 20% of the time.

Whilst I did use some trivial DTrace scripts to verify some of the above data (e.g. the number of actual poll(2) syscalls per second) most of this post is intended to set the scene and whet your appetite for Part 2.

Next time I hope to explain how I used DTrace to find the reason for the excessive context switches I observed. This led to me filing a performance bug and opened up a host of additional questions to do timers, timeouts and ticks and how they relate to the time of day.

Tag: ,