This global economy thing is scary!

My current electronics hobby project is to build some ADAT “Lightpipe” repeaters and patch panels for my home studio (yes, there are commercial products out there, but they are very pricey for what they do).

ADAT uses the same plastic fiberoptic “TOSLINK” technology found on many domestic CD/DVD players, games consoles, AV processors etc, but carries up to 8x 48KHz (or 4x 96KHz) 24bit PCM signals (i.e. a higher bandwidth requirement than traditional S/PDIF or Dolby5.1 applications).

Generally, I use RS-Components for such projects, but I’ve had real problems sourcing the TOSLINK tranmitter and receiver parts in the UK. They are only made by Toshiba and Sharp, and despite being ubiquitous in domestic audio appliances are very hard/expensive for hobbyists to find at a reasonable price.

Googling around revealed one enthusiast who used to sell them for £6 a piece, and another UK online supplier currently quoting over £17 per device. At those prices, one might as well buy the commercial product!

RS will source non-catalogue items for account holders, but they require a £200 minimum order (plus they quote 10+ days for delivery). The big distributor EBV don’t sell to the little guy at all (they require a trade account, and a minimum order of £1000).

Enter Digi-Key. They’re based in the USA, but have localised online ordering for the UK and many other countries. About 36 hours ago I placed an order for 20x TOTX147L and 20x TORX147L parts (3v, shuttered, 15Mbps transmitter and receiver). The price was about £1 each, plus £10 handling because my order was under £75, and plus £12 shipping because by order was under £100.

The UPS man knocked on the door a couple of hourse ago, taking COD for the £12 VAT duty due. Amazing! 40 parts for just £74. But what really blew my socks off was that it arrived about half an hour before the RS order I placed the same evening (for all the other bits I need for my project).

Solaris Threads Tunables – Part 1


It is my intention to make this a series of short postings, each looking at one or two specific tunables. Hopefully, this “Part 1” will be followed by a “Part 2” and a “Part 3” and so on — until I run out of interesting things to say on the subject. But I’m not the most prolific blogger, so don’t hold your breath!

Read The Paper

I suffer from blogger’s block. Part of my problem is gaugeing the amount of context which needs filling out before I can get to the meat of what I want to say. For this short series of posts I am not going to go into lots of detail about Solaris multithreading architecture. In Solaris 8 we began a U-turn on our famous MxN application threads architecture when we introduced the alternate 1-1 implementation. In Solaris 9 we dropped the MxN implementation altogether, and the thankless task of explaining why this was a good thing fell to me. I’m going to assume that you’ve read Multithreading in the Solaris Operating Environment, most of which still applies in Solaris 10 and <a
href=”http://opensolaris.org”>OpenSolaris (we’ve just made it even better).

Heed The Warnings


“It betrayed Isildur to his death. And some things that should not have been forgotten were lost. History became legend, legend became myth, and for two and half thousand years the Ring passed out of all knowledge. Until when chance came, it ensnared a new bearer. The Ring came to the creature Gollum, who took it deep into the tunnels of the Misty Mountains. And there, it consumed him. The Ring brought to Gollum unnatural long life. For five hundred years it poisoned his mind. And in the gloom of Gollum’s cave, it waited. Darkness crept back into the forest of the world. Rumor grew of a shadow in the East, whispers of a nameless fear, and the Ring of Power perceived. Its time had now come.” — J R R Tolkien, The Lord Of The Rings

The stuff I’m going to mention used to be hidden and was intended solely for our own use — not yours. We believe Solaris should perform well out of the box, and that the provision of tunables amount to an admission of failure. We have never formally published or documented these things before because: we don’t think you really need them; we don’t want them to become de facto standards; and because we need the freedom to remove or change them in future. In many cases tweaking these tunables will actually hurt application performance. And they are a very blunt instrument, affecting all the threads and synchronisation objects within a process. So although some regions of your code may speed up, other regions may suffer, with no net gain overall.

Please use this information responsibly. It is unlikely ISVs have tested their applications with anything other than the default values — I know we don’t. Be up-front with your suppliers (including us) when dealing with support issues, and make sure you eliminate these tunables early from their inquiries.

So Why Bother?

With the advent of OpenSolaris some things which should have been forgotten are going to be found. Indeed, you will find the names of all the tunables I’m going to mention here.
We can’t hide this stuff any more — nor do we want to. With OpenSolaris we say “what’s ours is yours”.

Of course, there may be cases where fiddling with these tunables actually helps application performance. Of course, this may be because the applications themselves are poorly written, but it also may indicate that we have further work to do in improving Solaris performance out of the box. Please do share your data! And sharing this information with your suppliers — especially some of the debugging features — may actually speed the resolution of some support issues.

The Secret’s Out

For now, I’m just going to tell you what the tunables are. You’ll have to wait for subsequent posts (or go, grok the source) for an understanding of what they actually do and how they may be useful to you.

In
/usr/src/lib/libc/port/threads/thr.c
we find these names:

QUEUE_SPIN
ADAPITVE_SPIN
RELEASE_SPIN
MAX_SPINNERS
QUEUE_FIFO
QUEUE_VERIFY
QUEUE_DUMP
STACK_CACHE
COND_WAIT_DEFER
ERROR_DETECTION
ASYNC_SAFE
DOOR_NORESERVE

The function set_thread_vars()
scans the environment variable list looking for these names prefixed by “_THREAD_” or “LIBTHREAD_”. The latter is deprecated now that libthread has been folded into libc, but provides some compatability back as far as the alternate thread library implementation in Solaris 8 (although there have been some changes along the way which I am not going to document in these postings).
The function etest()
specifies a maximum value for each variable and ensures that any user-defined value falls between zero and this limit. The default values are defined elsewhere in the code. In OpenSolaris the tunables are defined as follows:

envvar                          limit           default
_THREAD_QUEUE_SPIN              1000000         1000
_THREAD_ADAPITVE_SPIN           1000000         1000
_THREAD_RELEASE_SPIN*           1000000         500
_THREAD_MAX_SPINNERS            100             100
_THREAD_QUEUE_FIFO              8               4
_THREAD_QUEUE_VERIFY**          1               0
_THREAD_QUEUE_DUMP              1               0
_THREAD_STACK_CACHE             10000           10
_THREAD_COND_WAIT_DEFER         1               0
_THREAD_ERROR_DETECTION         1               0
_THREAD_ASYNC_SAFE              1               0
_THREAD_DOOR_NORESERVE          1               0

All but the last one also exist in Solaris 10 (and for now, the LIBTHREAD_ variants will also work). Making a change is as simple as setting an environment variable (just remember that all envvars are generally inherited by child processes).

I hope this has whetted your appetite for next next post, but in the meantime “hey, be careful out there”!

Tag: ,

Three Abstracts Submitted for CEC 2006

The following abstracts were submitted for Sun’s internal Customer Engineering Conference 2006. Of course there is no guarantee that this material will be accepted by the CEC panel, but I’d be happy to present the same (or similar) material at other events. If you’re interested, please drop me a line.

Microbenchmarking – Friend or Foe?

Many purchasing, configuration and development choices are made on the basis of benchmark data. Industry organisations such as SPEC and TPC exist to inject a measure of realism and fairness into the exercise. However, such benchmarks are not for the faint hearted (e.g. they require considerable hardware, software and people resources). Additionally, the customer may feel that an industry-standard benchmark is not sufficiently close to their own perceived requirements. Yet building a bespoke benchmark for a real world application workload is an order of magnitude harder than going with something “off the peg”. It is at this point that an alarming number of customers make the irrational leap to some form of microbenchmarking — whether it is good old “dd” to test an I/O subsystem, or perhaps LMbench’s notion of “context switch latency”. The whole is rarely greater than the sum of its parts, but the issue often ignored is that a microbenchmark — by very definition — only considers one tiny component at a time, and then only covers a small subset of functionality in total. Furthermore, it is often observed that some microbenchmarks are very poor predictors of actual system performance under real world workloads.

Is there any place for microbenchmarking? Certainly, we need to be aware that customers may be conducting ill-advised tests behind closed doors. But should we ever dare engage in such dubious activities ourselves? In short: yes! In the right hands microbenchmarks can highlight components likely to respond well to tuning, and assist in the tuning process itself. This session will focus on libMicro: an in-house, extensible, portable suite of microbenchmarks first used to drive performance improvements in Solaris 10. The libMicro project was driven by the conviction that “If Linux is faster, it’s a Solaris bug”. However, some of the initial data made the case so strongly that we chose to adopt the Monsters Inc. slogan “We scare because we care” at first! libMicro is now available to you and your customers under the CDDL via the OpenSolaris programme. Key components of libMicro will be demonstrated during this session. The demo will include data collection, reporting and adding of new cases to the suite.

Note: I was taking them seriously about 2500 chars and two paragraphs.

Synchronicity: Solaris Threads and CoolThreads

The Unified Process Model is one of the best kept secrets in Solaris 10. Yet this “so what?” feature entailed changes to over 1600 source files. But was it all a waste of effort? For over a decade Sun has been recognised as a thought leader in software multithreading, but did we lose the plot when we dropped the idealistic two level MxN implementation for something much simpler in Solaris 9? To both of these questions we must answer a resounding “No!”. Indeed, the Unified Process Model, under which every process is now potentially a multithreaded process, was only possible by a simpler, more scalable, more reliable, more maintainable, realistic one level 1:1 implementation. And all this goodness just happens to coincide with the CoolThreads revolution. As other vendors chime in with CMT, Solaris is streets ahead of Linux and other platforms in being able to deliver real benefits from this technology. It is extremely important that we are able to understand, articulate and exploit this synchronicity.

Note: this time I realised that they didn’t really mean 2500 chars!

DTrace for Dummies

Wonder what all the fuss is about? Need a good reason before you engage your brain with this stuff? Think this may be one new trick too far for an aging dog? Just curious? Then this session is for you! We have a reputation for making DTrace come alive for even the most skeptical and indifferent of crowds — D is certainly not for “dull” at our shows! Don’t worry, we won’t get you bogged down in syntax or architecture. But we will convince you of the dynamite that is the DTrace observability revolution — that, or you are dummer that we thought! Everything you see will happen live. We don’t use any canned scripts. Anything could happen. You’d be a fool to miss it!

Notes: This was a joint submisson from me and Jon Haslam. We’ve found our combination of sound technical content and brit humour very effective at getting across the DTrace value proposition to a wide audience. We first did our double act (Jon types while Phil talks) at SUPerG 2004. Following rave reviews we were asked to present a plenary session at SUPerG 2005.

Technorati Tags: ,

Scorching Mutexes with CoolThreads – Part 1

How could I not be excited about CoolThreads?! Regular readers of Multiple Threads will be aware of my technical white paper on

Multithreading in the Solaris Operating Environment
, and of more recent work making getenv() scale for Solaris 10. And there’s a lot more stuff I’m itching to blog — not least libMicro, a scalable framework for microbenchmarking which is now available to the world via the OpenSolaris site.

For my small part in today’s launch of the first CoolThreads
servers, I thought it would be fun to use libMicro to explore some aspects of application level mutex performance on the UltraSPARC T1 chip. In traditional symmetric multiprocess configurations mutex performance is dogged by inter-chip cache to cache latencies.

Little Monster

To applications, the Sun Fire T2000 Server looks like a 32-way monster. Indeed, log in to one of these babies over the network and you soon get the impression of a towering, floor-breaking, hot-air-blasting, mega-power-consuming beast. In reality, it’s an cool, quiet, unassuming 2U rackable box with a tiny appetite!

Eight processor cores — each with its own L1 cache and four hardware strands — share a common on-chip L2 cache. The thirty-two virtual processors see very low latencies from strand to strand, and core to core. But how does this translate to mutex performance? And is there any measurable difference between inter-core and intra-core synchronization?

For comparison, I managed to scrounge a Sun Fire V890 Server with eight UltraSPARC IV processors (i.e. 16 virtual processors in all). Both machines are clocked at 1.2GHz.

Spin city

First up, I took libMicro’s cascade_mutex test case for a spin. Literally! This test takes a defined number of threads and/or processes and arranges them in a ring. Each thread has two mutexes on which it blocks alternately; and each thread manipulates the two mutexes of the next thread in the ring such that only one thread is unblocked at a time. Just now, I’m only interested in the minimum time taken to get right around the loop.

The default application mutex implementation in Solaris uses an adaptive algorithm in which a thread waiting for a mutex does a short spin for the lock in the hope of avoiding a costly sleep in the kernel. However, in the case of an intraprocess mutex the waiter will only attempt the spin as long as the mutex holder is running (there is no point spinning for a mutex held by thread which is making no forward progress).

Wow!

With 16 threads running cascade_mutex the T2000 achieved a blistering 11.9us/loop (that’s less than 750ns per thread)! The V890, on the other hand, took a more leisurely 25.3us/loop. Clearly, mutex synchronization can be very fast with CoolThreads!

Naturally, spinning is not going to help the cascade_mutex case if you have more runable threads than available virtual processors. With 32 active threads the V890 loop time rockets to 850us/loop, whereas the T2000 (with just enough hardware strands available) manages a very respectable 32.4us/loop. Only when the T2000 runs out of virtual processors does the V890 catch up (due to better single thread performance). At 33 threads the T2000 jumps to 1140us/loop versus 900us/loop on the V890.

In conclusion

libMicro’s cascade_mutex case clearly shows that UltraSPARC T1 delivers incredibly low latency synchronization across 32 virtual processors. Whilst this is a good thing in general it is particularly good news for the many applications which use thread worker pools to express their concurrency.

In Part 2 we will explore the small difference cascade_mutex sees between strands on the same core and strands on different cores. Stay tuned!

Technorati Tags: , ,

Roy Chadowitz, 1955-2005

I started at Sun at the end of February, 1989. I think Roy joined within a month of me. When I made the jump from Customer Services to Sales Support, he was my first manager. I owe him a lot in terms of my personal development. He was always something of an older brother figure and mentor to me. Roy always showed a genuine interest in individuals and their families. He did all he could to make working for Sun more family-friendly. He set a good example of working hard whilst having fun, and without forgetting that there was much more to life than work.

The only criticism I ever heard of Roy was that we was very tight with money. Sometimes he seemed to give the impression that our business expenses, equipment budget and so on came directly out of his personal savings. But he always applied the same control to his own spending — only more so. Even when he was home bound it was hard to get Roy to spend even modest amounts on the equipment he really needed to do his job. I guess the chipboard coffin is standard Jewish practice, but I’m sure Roy would have heartily approved!

But what sticks with me most if Roy’s patience and endurance through so much suffering. He had so many secondary issues to deal with due to his early cancer treatment (I guess they went over the top on that because they didn’t think he had long to live).

I think it is worth telling the story of Roy being given 6 months to live — 10 years ago. I think we need to record his 50th birthday. The story of Roy tripping on a marquee guy rope and the walking around with a broken neck for at least 6 months is remarkable. If I recall correctly, he was in a lot of pain, but they put it down to his cancer. Roy proudly showed me his x-rays (first the beautiful intricate work of the neurosurgeons supporting his broken next, and then the functional rods inserted by the orthopedic chaps). Later he showed me the x-ray of the electrodes wired into his spine for pain relief (something like a cross between a TENS machine and a pacemaker).

He endured so many operations, so much pain, and so many setbacks. But he also made huge progress and frequently amazed us with with determination to bounce back and continue with his work. I will always remember seeing Roy in “The Directors” line-up at the 2004 kick-off meeting in Nottingham.

The last year was hard to watch. It doesn’t seem long after Nottingham that the sudden paralysis came. It was hard to see Roy brought so low. But he wouldn’t give up fighting, even then. Sometimes we felt he needed to stop work, but it had to be his choice. And as we watched, it seemed clear that Roy knew he would decline very rapidly if he had nothing to do, nothing to aim for any more. Very rarely did we hear any complaint.

It was my privilege to be welcomed in to the Chadowitz home on many occasions. Carol always kept me well fed, and Simon was often good company during the long waiting periods which come with certain system admin chores. I remember the excitement around Lee’s Bar Mitzvah, and Roy’s pride in explaining it all so me. What an amazing family! So much practical love. There were dark moments — times when it was clear that Roy was troubled about the burden he had become, but no-one would hear such talk! The love that Roy and Carol shared was an inspiration to see.

I last emailed Roy the day after his 50th birthday (although I didn’t know that at the time). two days before he finally left us. That was the first time I didn’t get a reply. When Greg phoned on the Saturday evening it was the call I had been dreading for years — 10 years. I’d expected it, but somehow Roy’s determination to keep going had lulled me into a false sense of security. It was like hitting a wall.

I had got used to seeing Roy confined to his bed. Even then I’d seen him struggle — and improve! But I’d been away for too many weeks (business trips and heavy colds kept me away). I didn’t see Roy at his lowest. I didn’t see him when he was unable to speak. That would have been intolerable for him. He always had so much to say. So much to give.

His most common parting shot to me was “Keep smiling, Phil”. It’s hard to do that just now, Roy, but I’ll try!


Note: I have closed comments on this entry due to comment SPAM.
Please email me directly if you want to add something further.

blogs.sun.com on BBC Radio 4

I just happened to listen to this: “BBC R4 16:00 Shop Talk – From online consumer blogs to mobile phone pictures being broadcast as TV news; the public are using technology to tell their side of the story. Heather Payton and her guests discuss the impact ‘citizen’s media’ is having on big business.”

Great programme about blogging in general, and the commercial opportunities for blogging in particular. Lots of excellent input from our very own Simon Phipps (great job, Simon)! And we can all give ourselves a hearty pat on the back for being part of such an exciting story 🙂

Look out for the “Listen Again” link via this page: http://www.bbc.co.uk/radio4/news/shoptalk/ (today’s programme should appear later today, and hang around for about a week).

Keep blogging!

How not to load-balance connections with accept()

It is a fairly well know that multiple threads in multiple processes can wait for TCP/IP connections on the same listening socket. What is not so well known are the nuances of implementation which affect the load balancing of such connections across processes and threads. Here I’m going to share what I learned whilst working on improving the scalability of a popular 3rd party software product (which shall remain nameless).

Let’s not worry too much about how multiple processes come to be sharing the same listening socket (this could easily be achieved by inheritance across fork(2) or using ioctl(2) with I_SENDFD (see streamio(7I))). Instead, let’s consider how connections are allocated to threads in such a situation.

The Relatively Easy Way

Threads blocking in accept(3SOCKET) join socket-specific queue in “prioritised FIFO” order. This is almost exactly what we want for round-robin allocation of connections!

However, it does pose a challenge when a thread “just happens” to have a low priority when it arrives at the queue. If having a low priority is a relatively rare occurrence, this could mean that threads thus afflicted suffer starvation (i.e. if there is always one or more higher priority threads queued whenever a connection arrived).

In the olden days this was a real issue for threads in the TS scheduling class. Of course, one solution would be to use a fixed priority scheduling class (such as FX).
However, what was needed was for threads sleeping in “prioritised FIFO” queues to have their priority and queue position recalculated on a regular basis (in the same manner as runnable threads waiting for a CPU).

The bug 4468181 low priority TS threads on a sleep queue can be victimised was first fixed in Solaris 9. The fix has since been made available for Solaris 8, Solaris 7, and Solaris 2.6 via patches, but in these cases it has to be enabled by adding the following to /etc/system:

set TS:ts_sleep_promote=1

The fix is enabled by default in Solaris 9 onwards, and once it is in place connections to a specific listening socket will be allocated to threads on a near round-robin basis.

The Extremely Hard Way

For whatever reason (“it seems to work ok on AIX” is not a good one!) some folk like to put their listening sockets into non-blocking mode, and then to wait for connections in select(3C) or poll(2) — it is important to note that in Solaris, select() is implemented using poll() in any case.

This is all fine and dandy, except that Solaris uses LIFO for the poll sleep queues. Whilst this suits thread worker pools very well (because it helps keep caches warm) it is precisely not what one needs for the round-robin load-balancing of connections between threads.

Of course, the “prioritised FIFO” queueing of accept() and the LIFO queueing of poll() is either not specified or it is classified as “implementation specific” by the standards. Solaris does what it does because it seems to please most of the people most of the time. However, if you assume FIFO you can be in for a nasty surprise!

Actually, having multiple threads waiting in poll() for connections on the same listening socket is a pretty dumb idea. The reason is that a thundering herd will occurr each time a connection comes in.

Some implementations actually result in every thread coming right out of poll() and diving into accept() where only one will succeed. Solaris is a little smarter in that once one thread has picked up a connection, other threads still in poll() will go back to sleep.

Desparate Situations, Desperate Measures

If you have an application which is doing things the extremely hard way, then the best solution is a complete rewrite! However, if this is not an option open to you “The Harman House Of Hacks” is proud to present the following:

#define _REENTRANT
#include <sys/types.h>
#include <sys/socket.h>
#include <dlfcn.h>
#include <stdlib.h>
static int (*orig_accept)();
#pragma init(iaccept_init)
#pragma weak accept=_accept
static void
iaccept_init()
{
    orig_accept = (int(*)())dlsym(RTLD_NEXT, "_accept");
}
static int
_accept(int s, struct sockaddr *addr, socklen_t *addrlen)
{
    int res;
    (void) poll(0, 0, drand48() * 21.0);
    res = orig_accept(s, addr, addrlen);
    return res;
}

This is intended to be compiled and used something like this:

$ cc -K pic -G -o iaccept.so iaccept.c
$ LD_PRELOAD=./iaccept.so brokenapp

This works by interposing a new implementation of accept() which does a short random sleep before diving into the standard implementation of accept(). This means that the thread at the head of the list socket’s sleep queue is nolonger likely to be the first thread into accept().

Conclusion

Of course, this is a rather blunt instrument which should not be applied universally — or lightly. However, although more threads in the thundering herd are likely to come all the way out of poll() this hack does a very nice job of scambling the sleep queue order. For the application for which the above hack was written, the latency added to accept() was not an issue. But more importantly, connections were very evenly balanced across processes.

Technorati Tags: ,

data != information != knowledge != wisdom != truth

Whilst reading this fun posting on the subject of Intelligent Design (surely, something we in the OpenSolaris universe know a lot about!) I was taken with the blog’s heading:

  • DATA IS NOT INFORMATION
  • INFORMATION IS NOT KNOWLEDGE
  • KNOWLEDGE IS NOT WISDOM
  • WISDOM IS NOT TRUTH

I like it because it speaks to little things (such as the blogosphere), and big things (such as the answer to the ultimate question of life, the universe, everything).

Another posting
explains the caption’s origins. As a blog disclaimer, it is rather more useful than this:

THE INDIVIDUALS WHO POST HERE WORK AT SUN MICROSYSTEMS. THE OPINIONS EXPRESSED HERE ARE THEIR OWN, ARE NOT NECESSARILY REVIEWED IN ADVANCE BY ANYONE BUT THE INDIVIDUAL AUTHORS, AND NEITHER SUN NOR ANY OTHER PARTY NECESSARILY AGREES WITH THEM.

A Blogging Roller-Coaster Ride!

Why do I spend so much time fighting Roller? Why can’t I just get on and do my blogging thang? My patience is wearing thin. Shouldn’t the default environment just work? Why do I have to become and HTML/CSS expert? I didn’t sign up for this!

And judging by the quality of the OpenSolaris blogs, I’m not alone in my frustation. The content is great, but the presentation is often poor.

So, in the absence of better advice I have decided on the following:

  • I will use the “Basic” theme with minimal hacks (see below)
  • I will hijack <h4> … </h4> for my own subheading use
  • I will mark all my paragraphs with <p> … </p> containers
  • I will never use <p> as a separator
  • I will spurn the use of <br> separators
  • I will employ <pre class=”code”> … </pre> for code quotes

I’ve modified the “Basic” theme’s default _css file to ensure that “h4” headings have margins which match standard “p” paragraphs, and that preformatted code has a readable font size …

#set( $theme = "basic" )
#parse("/WEB-INF/classes/themes/css.vm")
<style type="text/css" media="print">
.rWeblogCategoryChooser{ visibility: hidden; }
.entries{ width: 100%; float: none; }
.rightbar{ visibility: hidden; }
</style>
<style type="text/css">
h4 { margin:10px; }
pre.code { font-size: 10pt; }
</style>

I didn’t like the way the “Basic” theme’s default _day file adds <p> … </p> containers around each day entry because this forces nesting of the “p” tags within the final HTML. Nor did I like the way the entry title was just marked with the “b” tag. So this is what I now use …

<div class="box">
<div class="entry">
#showDayPermalink( $day )
#showEntryDate( $day )
</div>
#foreach( $entry in $entries )
<a name="$utilities.encode($entry.anchor)" id="$utilities.encode($entry.anchor)"></a>
<h3>$entry.title</h3>
#showEntryText($entry)
<span class="dateStamp">(#showTimestamp($entry.pubTime))</span>
#showEntryPermalink( $entry )
#showCommentsPageLink( $entry )
#end
#showLinkbacks( $day )
</div>

For such an advanced blogging environment I find it absolutely incredible that I have to edit both the bookmarks database and the main weblog template jsut to get rid of the default (and useless) “News” links! Here’s my revised “Basic” theme weblog template …

<div class="entries">
<h1>#showWebsiteTitle()</h1>
<h2>#showWebsiteDescription()</h2>
#showWeblogCategoryChooser()
#showNextPreviousLinks()
#showWeblogEntries("_day" 15)
</div>
<div class="rightbar">
<h2>Calendar</h2>
<div class="sidebar">
#showWeblogCalendar()
</div>
<h2>RSS Feeds</h2>
<div class="sidebar">
#showRSSBadge()<br>
#showRSSLinks()
</div>
<h2>Search</h2>
<div class="sidebar">
#showSearchForm()
</div>
<h2>Links</h2>
<div class="sidebar">
#showBookmarks("My Sites" true false)
#showBookmarks("Blogroll" true false)
</div>
<h2>Navigation</h2>
<div class="sidebar">
#showCssNavBar()
</div>
</div>

Well, there you are. That’s what I did. I just wish it was a simple matter for others to use the above, but unless you are also using the “Basic” theme it almost certainly won’t work for you! However, the principles should be applicable to other themes.

I have also taken the precaution of downloading the current Roller distro just incase someone fiddles with the “Basic” theme in a later release. Sigh!

Caring For The Environment – Making getenv() Scale

Although a relatively minor contributor to OpenSolaris I still have the satisfaction of knowing that every Solaris 10 process is using my code. But who in their right mind needs getenv(3C) to scale? Of course if you don’t care about thread safety (as is currently the case with glibc version 2.3.5 — and hence with Linux) your implementation might scale very nicely thankyou!

Sun on the other hand does care about thread safety (and we’ve been doing so for a long time). However we had rather assumed that no one in their right mind would need getenv() to scale, so our implemetation was made thread safe by putting a dirty great mutex lock around every piece of code which manipulates environ. After all, as our very own Roger Faulkner is so fond of saying: “Correctness is a contraint; performance is a goal”. And who cares about getenv() performance anyway?

But Who Really Cares?

Well it turns out that there are some significant applications which depend on getenv() scalability (and which scale wonderfully on Linux … where thread safety is often ignored … they are just very lucky that no one seems to be updating environ whilst anyone else is reading it). So Bart Smaalders filed bug 4991763 getenv doesn’t scale and said he thought it was an excellent opportunity for my first putback. Thanks Bart!

For some time I’ve been saying: “If Linux is faster, it’s a Solaris bug!” but somehow 4991763 didn’t quite fit the bill. Firstly, I think an application which depends on getenv() scalability is broken. Secondly, Linux is just feeling lucky, punk. However, I do firmly believe that we should do all we can to ensure that Solaris runs applications well — even those which really need some tuning of their own. I had also been itching for an chance to explore opportunities for lockless optimisations in libc, so all in all 4991763 was an opportunity not to be missed!

A Complete Rewrite

The existing implementation of getenv(), putenv(), setenv(), and unsetenv() was split across three files (getenv.c, putenv.c and nlspath_checks.c) with the global mutex libc_environ_lock being defined elsewhere. Things had become fairly messy and inefficient so I decided on a complete rewrite.

NLSPATH security checks had introduced yet another global variable and a rather inefficient dance involving a mutex on every {get,put,set,unset}env() call just to ensure that clean_env() was called the first time in. In this instance it was an easy matter to remove the mutex from the fast path by doing a lazy check thus:

static mutex_t                  update_lock = DEFAULTMUTEX;
static int                      initenv_done = 0;
char *getenv(const char *name)
{
    char                    *value;
    if (!initenv_done)
        if (findenv(environ, name, 1, &value) != NULL)
            return (value);
    return (NULL);
}

The test was then repeated under the protection of the mutex in initenv() thus:

extern void                     clean_env();
static void
initenv()
{
    lmutex_lock(&update_lock);
    if (!initenv_done || ... ) {
        /* Call the NLSPATH janitor in. */
        clean_env();
        .
        .
        .
        initenv_done = 1;
    }
    lmutex_unlock(&update_lock);
}

By rolling putenv.c into getenv.c I was able to eliminate the use of globals altogether, which in turn allowed the compiler to produce better optimised code.

Look, No Locks!

But the biggest part of the rewrite was to make the fast path of getenv() entirely lockless. What is not apparent above is that findenv() is entirely lockless.

Various standards define the global environment list pointer:

extern char **environ;

This has to be kept as a consistent NULL terminated array of pointers to NULL terminated strings. However, the standards say nothing about how updates are to be synchronised. More recent standards forbid direct updates to environ itself if getenv() and friends are being used.

Yet the requirement that environ is kept consistent is precisely what we need to implement a lockless findenv(). The big issue is that whenever the environ list is updated, anyone else in the process of scanning it must not see an old value which has been removed, or miss a value which has not been removed.

The traditional approach is to allocate an array of pointers, with environ pointing to the first element. When someone needs to a new value to the environment list we simply add it to the end of the list. But how do we go about deleting values? And what if we need to add a new value when the allocated array is already full? If you care about threads, it’s not long before you need to introduce some locking!

The new implementation contains two “smarts” which meets these challenges without introducing locks into the getenv() path …

Double It And Drop The Old One On The Floor

When the new implementation needs a bigger environ list, it simply allocates a new one which is twice as large and copies the old list into it. The old list is never reused — it is left intact for the benefit of anyone else who might happen to be still traversing it.

This may sound wasteful, but the environment list rarely needs to be resized. The wastage is also bounded — it is quite easy to prove mathematically that this strategy never consumes more than 3x the space required by an array of optimal size.

However, one teeny weeny issue with the “drop it on the floor” approach is that leak detectors can get a tad upset if they find allocated memory which is not referenced by anyone. With a view to keeping me on the straight and narrow — but mostly to avert high customer support call volumes — Chris Gerhard recommended that I keep a linked list of all internally dropped memory (just to keep those goody-goody leak detectors happy).

I first met Chris in 1989. He was on my interview panel when I joined Sun. I do hope he feels he did a good job that day!

Overwrite With The First Entry And Increment

I was bouncing some other getenv() ideas around with Chris when he also gave me just what I needed for deletes. The old code just grabbed the global lock, found the element to be deleted, shifted the rest of the list down one slot (overwriting the victim), and then released the lock.

Chris had the bright idea of copying the first element in the list over the victim, and then incrementing the environ pointer itself. The worst case would be that the same element might be seen twice by another thread, but this is not a correctness issue.

This led to two further changes:

  1. New values are now added at the bottom of the environment list
    (with the environ pointer being decremented once the new value is in place).
  2. When a new doube-sized environment list needs to allocated, the
    old one is copied into the top of the new one (instead of the bottom) so that the list can then be grown downwards.

OK, Not Entirely Lockless

Obviously mutex lock protection is still needed to serialise all updates to the environment list. The new implementation has a lock for all seasons: update_lock (e.g. for updating initenv_done and for protecting environ itself). However the new getenv() is entirely lockless (i.e. once clean_env() has been called once).

Another important issue is that it is considered bad practice forsystem libraries to hold a lock while calling malloc(). For this reason the first two thirds of addtoenv() are inside a for(;;) loop. If it is necessary to allocate a larger environment array addtoenv() needs to drop update_lock temporarily. However this opens a window for another thread to modify environ in such a way that means we must retry. This loop is controlled by a simple generation number environ_gen (also protected by update_lock).

Guaranteeing Consistency

That’s almost all there is to it. However in multiprocessor systems we still have to make sure that memory writes on one CPU happen in such a way that they don’t appear out of sequence on another CPU. Of course this is taken care of automatically when we use mutex locks.

Consider the following code fragment to insert a new item:

environ[-1] = string;
environ--;

It is vitally important the two memory writes implied by this are seen in the same order to every CPU in the system. On SPARC today this doesn’t matter since all ABI compliant binaries run in Total Store Order mode (i.e. stores are guarantted to be visible in the order in which they are submitted). But is it possible that future systems will used a more relaxed memory model.

However, this is not just a hardware issue, it is also a compiler issue. Without extra care the compiler might reorder the two stores, since the C language cares nothing for threads. I had quite a long discussion with a number of folk concerning “sequence points” and the use of volatile in C.

The eventual solution was this:

environ[-1] = string;
membar_producer();
environ--;

First, the function membar_producer() serves as a sequence point, guaranteeing that the C compiler will preserve the order of the preceding and following statements. Secondly, it provides any store barriers needed by the underlying guarantee the same effect as Total Store Order for the preceding and following instructions.

A Spirit Of Generosity

My new implementation was integrated into s10_67 yet despite my own extensive testing it caused a number of test suites to fail in later testing. This was tracked down to common test suite library code which updated environ directly. Yuck! Although this kind of thing is very much frowned on by the more recent standards it was felt that if our own test cases did it there was a good chance that some real applications did it too. So with some reluctance I filed 6183277 getenv should be more gracious to broken apps.

If someone else if going to mess with environ under your nose there’s not a lot you can do about it. However it is fairly easy to detect the majority of cases (except for a few really nasty data races) by keeping a private copy my_environ which can be compared
with the actual value of environ. If these two values are ever found to differ we just go back to square one and try again.

So the above fragment for adding an item now looks more like this:

if (my_environ != environ || ...)
    initenv();
    .
    .
    .
    my_environ[-1] = string;
    membar_producer();
    my_environ--;
    environ = my_environ;

Conclusion

My second putback integrated into s10_71. Following this I had to fight off one other challenge from Rod Evans who filed 6178667 ldd list unexpected (file not found) in x86 environment. However this turned out to be not my fault, but a latent buffer overrun in ldd(1) exposed by different heap usage patterns. Of course, the engineer who introduced this heinous bug (back in January 2001) will remain anonymous (sort of). Still, he did buy me a beer!

Of course, the serious point is that when you change something which is used by everyone, it is possible that you expose problems elsewhere. It is to be expected that the thing last changed will get the blame. Such changes are not for the faint-hearted, but they can be a whole lot of fun!

My first experience of modifying Solaris actually resulted in two putbacks into the Solaris gate, but I learnt a great deal along the way. Dizzy with my success, I am now actively seeking other opportunities for lockless optimisations in libc!

Technorati Tags: ,