It is a fairly well know that multiple threads in multiple processes can wait for TCP/IP connections on the same listening socket. What is not so well known are the nuances of implementation which affect the load balancing of such connections across processes and threads. Here I’m going to share what I learned whilst working on improving the scalability of a popular 3rd party software product (which shall remain nameless).
Let’s not worry too much about how multiple processes come to be sharing the same listening socket (this could easily be achieved by inheritance across fork(2) or using ioctl(2) with I_SENDFD (see streamio(7I))). Instead, let’s consider how connections are allocated to threads in such a situation.
The Relatively Easy Way
Threads blocking in accept(3SOCKET) join socket-specific queue in “prioritised FIFO” order. This is almost exactly what we want for round-robin allocation of connections!
However, it does pose a challenge when a thread “just happens” to have a low priority when it arrives at the queue. If having a low priority is a relatively rare occurrence, this could mean that threads thus afflicted suffer starvation (i.e. if there is always one or more higher priority threads queued whenever a connection arrived).
In the olden days this was a real issue for threads in the TS scheduling class. Of course, one solution would be to use a fixed priority scheduling class (such as FX).
However, what was needed was for threads sleeping in “prioritised FIFO” queues to have their priority and queue position recalculated on a regular basis (in the same manner as runnable threads waiting for a CPU).
The bug 4468181 low priority TS threads on a sleep queue can be victimised was first fixed in Solaris 9. The fix has since been made available for Solaris 8, Solaris 7, and Solaris 2.6 via patches, but in these cases it has to be enabled by adding the following to /etc/system:
set TS:ts_sleep_promote=1
The fix is enabled by default in Solaris 9 onwards, and once it is in place connections to a specific listening socket will be allocated to threads on a near round-robin basis.
The Extremely Hard Way
For whatever reason (“it seems to work ok on AIX” is not a good one!) some folk like to put their listening sockets into non-blocking mode, and then to wait for connections in select(3C) or poll(2) — it is important to note that in Solaris, select() is implemented using poll() in any case.
This is all fine and dandy, except that Solaris uses LIFO for the poll sleep queues. Whilst this suits thread worker pools very well (because it helps keep caches warm) it is precisely not what one needs for the round-robin load-balancing of connections between threads.
Of course, the “prioritised FIFO” queueing of accept() and the LIFO queueing of poll() is either not specified or it is classified as “implementation specific” by the standards. Solaris does what it does because it seems to please most of the people most of the time. However, if you assume FIFO you can be in for a nasty surprise!
Actually, having multiple threads waiting in poll() for connections on the same listening socket is a pretty dumb idea. The reason is that a thundering herd will occurr each time a connection comes in.
Some implementations actually result in every thread coming right out of poll() and diving into accept() where only one will succeed. Solaris is a little smarter in that once one thread has picked up a connection, other threads still in poll() will go back to sleep.
Desparate Situations, Desperate Measures
If you have an application which is doing things the extremely hard way, then the best solution is a complete rewrite! However, if this is not an option open to you “The Harman House Of Hacks” is proud to present the following:
#define _REENTRANT #include <sys/types.h> #include <sys/socket.h> #include <dlfcn.h> #include <stdlib.h> static int (*orig_accept)(); #pragma init(iaccept_init) #pragma weak accept=_accept static void iaccept_init() { orig_accept = (int(*)())dlsym(RTLD_NEXT, "_accept"); } static int _accept(int s, struct sockaddr *addr, socklen_t *addrlen) { int res; (void) poll(0, 0, drand48() * 21.0); res = orig_accept(s, addr, addrlen); return res; }
This is intended to be compiled and used something like this:
$ cc -K pic -G -o iaccept.so iaccept.c $ LD_PRELOAD=./iaccept.so brokenapp
This works by interposing a new implementation of accept() which does a short random sleep before diving into the standard implementation of accept(). This means that the thread at the head of the list socket’s sleep queue is nolonger likely to be the first thread into accept().
Conclusion
Of course, this is a rather blunt instrument which should not be applied universally — or lightly. However, although more threads in the thundering herd are likely to come all the way out of poll() this hack does a very nice job of scambling the sleep queue order. For the application for which the above hack was written, the latency added to accept() was not an issue. But more importantly, connections were very evenly balanced across processes.
Technorati Tags: OpenSolaris, Solaris
This reminds me of a problem I dealt with about 7 years ago on Trusted Solaris 1.2 (SunOS 4.1.3_U1 based). As you well know SunOS wasn’t threaded and many applications made assumptions based on that. A web server product, the distant child of which Sun now owns, thought it was a good idea to fork off multiple children and have them all dive into accept() on the same fd.
Worked fine on SunOS 4.1.3, however this played merry hell with things on Trusted Solaris 1.2. Why ? Well unlike base SunOS its Trusted Solaris sibling sometimes couldn’t complete the accept() without comming out of the kernel and having the tnetd process do some CIPSO/macsix mapping of the labels and then diving back in to the kernel again. I’ve forgotten most of the rest of the details but it resulted in the accept() failing and returning with
EAGAIN which the app wasn’t expecting. IIRC the result was that all of the app processed failed some how after this.
I’m pretty sure I documented all this in a bug somewhere but I couldn’t find it with a quick search.
This blog exactly describes the behavior of lighttpd and one of the challenges I was having in making it go faster without too much surgery. The right thing would be for it to use event ports, but that’s not currently how it works.
Anyway, this interposer delivered quite a bump in throughput with this workload. Thanks!
Cool! Happy to help 🙂