Fun replicating the Sun Unified Storage Simulator on my Mac

I was there when they launched the Sun Storage 7000 Unified Storage Systems (“there” being the CEC conference in Las Vegas). Within an hour of the launch I had downloaded the Sun Unified Storage Simulator for VMware Fusion and started playing with this ultra-cool software stack on my MacBook Pro.

Later in the evening, Bryan Cantril and Mike Shapiro, the geniuses behind these amazing new products, issued a challenge for someone to be the first to set up a pair of laptops with a fully replicated virtualised service.

Not being one to pass up such challenges lightly, I nonetheless thought I would take it stage further with an attempt at setting up replicated servers on just one laptop (because that’s all I had with me).

I had hoped to write a long and detailed tale of how I laboured hard and long (against the all odds, etc) to bring this amazing feat to pass, but in the event it was just all too easy. In fact it is taking longer to write this blog than it took to do the work!

I used Fusion’s host-only networking to set up a private 10.0.0.0/24 network between the two virtual machines. Everything I tried “just worked” and was so intuitive, that there was no need to read any documentation.

Having demonstrated my solution to Mike, I took a series of three screen shots of my Mac desktop to illustrate the replication in progress for this blog. Each includes VMware Fusion consoles for both servers, the web management interface of each, and a local Terminal window showing the Mac being an NFS client of the primary server.

The first screen shot shows the replica in sync with the primary server, with each seeing 20.2MB in the default share.

The second shows a dd(1) command having just created a 10MB file, with 5.1MB (i.e. 25.3MB – 20.2MB) having made it to disk already on the primary server (but no change on the replica).

The final shot, taken less than a minute later, shows 30.2MB on both servers, the replication of the new 10MB file having just completed.

How cool is that?!

Niagara 2 memory throughput according to libMicro

libMicro is a portable, scalable microbenchmarking framework which Bart Smaalders and I put together a little while back. It is available to the world via the OpenSolaris website, under the CDDL license, so there really is nothing to stop you recreating the data in this posting.

Although designed for testing individual APIs, the libMicro framework has proven useful for other investigations. For instance, the memrand case, which does negative stride pointer chasing, can be configured to test processor and cache and memory latencies…

huron$ bin/memrand -s 128m -B 1000000 -C 10
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1   1      0.15232           12        0  1000000 134217728
huron$

The above shows 12 samples (we asked for at least 10) of 1,000,000 negative stride pointer references striped across 128MB of memory. The platform is a Sun SPARC Enterprise T5220 server, with a UltraSPARC T2 processor running at 1.4GHz. This simple test indicates a memory read latency of 152ns.

But we can also use libMicro’s multiprocess and multithread scaling capabilities to extend this test to measure memory throughput scaling…

huron$ for i in 1 2 4 8 16 32 64; do bin/memrand -s 128m -B 1000000 -C 10 -T $i; done
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1   1      0.15223           12        0  1000000 134217728
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1   2      0.15176           12        0  1000000 134217728
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1   4      0.15208           12        0  1000000 134217728
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1   8      0.25472           12        0  1000000 134217728
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1  16      0.26242           12        0  1000000 134217728
prc thr   usecs/call      samples   errors cnt/samp     size
memran        1  32      0.24964           12        0  1000000 134217728
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1  64      0.24063           12        0  1000000 134217728
huron$

This shows that up to 4 concurrent threads see 152ns latency, with 64 threads (i.e. full processor utilisation) seeing 240ns latency, which equates to a throughput of 267 million memory reads per second (i.e. 64 / 0.240e-6). Just to set this in context, here are some data for a quad socket Tigerton system running at 2.93GHz…

tiger$ for i in 1 2 4 8 16; do bin/memrand -s 128m -B 1000000 -C 10 -T $i; done
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1   1      0.15559           12        0  1000000 134217728
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1   2      0.15621           12        0  1000000 134217728
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1   4      0.15667           12        0  1000000 134217728
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1   8      0.17726           12        0  1000000 134217728
prc thr   usecs/call      samples   errors cnt/samp     size
memrand       1  16      0.18654           12        0  1000000 134217728
tiger$

This shows a peak throughput of about 86 million memory reads per second (i.e. 16 / 0.186e-6), making the single chip UltraSPARC T2 processor’s throughput 3x that of its quad chip rival. Of course, mileage will vary greatly from workload to workload, but pretty impressive nonetheless, heh?

Now what’s the chance of that?

“Congratulations, you have been randomly selected to win a free all accommodation included vacation to Florida Bahamas. Press 9 for more information.”

How lucky am I?!

I received this automated call twice within ten minutes to two different phone numbers! How random is that?!”

So what possessed me to hang up? Well, if you have any sense, you’ll do the same. If these people can’t even be honest about the way choose your number, why should you trust them with anything else?

Moral: If something appears too good to be true, it probably is.

Anyway, must dash as I’ve got to book my tickets for Nigeria. A very friendly lady who is the widow of some ex quasi government official needs my help laundering a few million dollars…

Taking UFS new places safely with ZFS zvols


I’ve just read a couple of intriguing posts which discuss the possibility of hosting UFS filesystems on ZFS zvols. I mean, who in their right mind…? The story goes something like this …

# zfs create tank/ufs
# newfs /dev/zvol/rdsk/tank/ufs
# mount /dev/zvol/dsk/tank/ufs /ufs
# touch /ufs/file
# zfs snapshot tank/ufs@snap
# zfs clone tank/ufs@snap tank/ufs_clone
# mount /dev/zvol/dsk/tank/ufs_clone /ufs_clone
# ls -l /ufs_clone/file

Whoopy doo. It just works. How cool is that? I can have the best of both worlds (e.g. UFS quotas with ZFS datapath protection and snapshots). I can have my cake and eat it!

Well, not quite. Consider this variation on the theme:

# zfs create tank/ufs
# newfs /dev/zvol/rdsk/tank/ufs
# mount /dev/zvol/dsk/tank/ufs /ufs
# date >/ufs/file
# zfs snapshot tank/ufs@snap
# zfs clone tank/ufs@snap tank/ufs_clone
# mount /dev/zvol/dsk/tank/ufs_clone /ufs_clone
# cat /ufs_clone/file

What will the output of the cat(1) command be?

Well, every time I’ve tried it so far, the file exists, but it contains nothing.

The reason for this is that whilst the UFS metadata gets updated immediately (ensuring that the file is created), the file’s data has to wait a while in the Solaris page cache until the fsflush daemon initiates a write back to the storage device (a zvol in this case).

By default, fsflush will attempt to cover the entire page cache within 30 seconds. However, if the system is busy, or has lots of RAM — or both — it can take much longer for the file’s data to hit the storage device.

Applications that care about data integrity across power outages and crashes don’t rely on fsflush to do their dirty (page) work for them. Instead, they tend to use raw I/O interfaces, or fcntl(2) flags such as O_SYNC and O_DSYNC, or APIs such as fsync(3C), fdatasync(3RT) and msync(3C).

On systems with large amounts of RAM, the fsflush daemon can consume inordinate amounts of CPU. It is not uncommon to see a whole CPU pegged just scanning the page cache for dirty pages. In configurations where applications take care of their own write flushing, it is considered good practice to throttle fsflush with the /etc/system parameters autoup and tune_t_fsflushr. Many systems are configured for fsflush to take at least 5 minutes to scan the whole of the page cache.

From this is it clear that we need to take a little more care before taking a snapshot of a UFS filesystem hosted on a ZFS zvol. Fortunately, Solaris has just want we need:

# zfs create tank/ufs
# newfs /dev/zvol/rdsk/tank/ufs
# mount /dev/zvol/dsk/tank/ufs /ufs
# date >/ufs/file
# lockfs -wf
# zfs snapshot tank/ufs@snap
# lockfs -u
# zfs clone tank/ufs@snap tank/ufs_clone
# mount /dev/zvol/dsk/tank/ufs_clone /ufs_clone
# cat /ufs_clone/file

Notice the addition of just two lockfs(1M) commands. The first blocks any writers to the filesystem and causes all dirty pages associated with the filesystem to be flushed to the storage device. The second releases any blocked writers once the snapshot has been cleanly taken.

Of course, this will be nothing like as quick as the initial example, but at least it will guarantee that you get all the data you are expecting. It’s not just no data we should be concerned about, but also stale data (which is much harder to detect).

I suppose this may be a useful workaround for folk waiting for some darling features to appear in ZFS. However, don’t forget that “there’s no such thing as a free lunch”! For instance, hosting UFS on ZFS zvols will result in the double caching of filesystem pages in RAM. Of course, as a SUNW^H^H^H^HJAVA stock holder, I’d like to encourage you to do just that!

Solaris is a wonderfully well-stocked tool box full of the great technology that is ideal for solving many real world problems. One of the joys of UNIX is that there is usually more than one way to tackle a problem. But hey, be careful out there! Make sure you do a good job, and please don’t blame the tools when you screw up. A good rope is very useful. Just don’t hang yourself!

Technorati Tags: , ,

This global economy thing is scary!

My current electronics hobby project is to build some ADAT “Lightpipe” repeaters and patch panels for my home studio (yes, there are commercial products out there, but they are very pricey for what they do).

ADAT uses the same plastic fiberoptic “TOSLINK” technology found on many domestic CD/DVD players, games consoles, AV processors etc, but carries up to 8x 48KHz (or 4x 96KHz) 24bit PCM signals (i.e. a higher bandwidth requirement than traditional S/PDIF or Dolby5.1 applications).

Generally, I use RS-Components for such projects, but I’ve had real problems sourcing the TOSLINK tranmitter and receiver parts in the UK. They are only made by Toshiba and Sharp, and despite being ubiquitous in domestic audio appliances are very hard/expensive for hobbyists to find at a reasonable price.

Googling around revealed one enthusiast who used to sell them for £6 a piece, and another UK online supplier currently quoting over £17 per device. At those prices, one might as well buy the commercial product!

RS will source non-catalogue items for account holders, but they require a £200 minimum order (plus they quote 10+ days for delivery). The big distributor EBV don’t sell to the little guy at all (they require a trade account, and a minimum order of £1000).

Enter Digi-Key. They’re based in the USA, but have localised online ordering for the UK and many other countries. About 36 hours ago I placed an order for 20x TOTX147L and 20x TORX147L parts (3v, shuttered, 15Mbps transmitter and receiver). The price was about £1 each, plus £10 handling because my order was under £75, and plus £12 shipping because by order was under £100.

The UPS man knocked on the door a couple of hourse ago, taking COD for the £12 VAT duty due. Amazing! 40 parts for just £74. But what really blew my socks off was that it arrived about half an hour before the RS order I placed the same evening (for all the other bits I need for my project).

Solaris Threads Tunables – Part 1


It is my intention to make this a series of short postings, each looking at one or two specific tunables. Hopefully, this “Part 1” will be followed by a “Part 2” and a “Part 3” and so on — until I run out of interesting things to say on the subject. But I’m not the most prolific blogger, so don’t hold your breath!

Read The Paper

I suffer from blogger’s block. Part of my problem is gaugeing the amount of context which needs filling out before I can get to the meat of what I want to say. For this short series of posts I am not going to go into lots of detail about Solaris multithreading architecture. In Solaris 8 we began a U-turn on our famous MxN application threads architecture when we introduced the alternate 1-1 implementation. In Solaris 9 we dropped the MxN implementation altogether, and the thankless task of explaining why this was a good thing fell to me. I’m going to assume that you’ve read Multithreading in the Solaris Operating Environment, most of which still applies in Solaris 10 and <a
href=”http://opensolaris.org”>OpenSolaris (we’ve just made it even better).

Heed The Warnings


“It betrayed Isildur to his death. And some things that should not have been forgotten were lost. History became legend, legend became myth, and for two and half thousand years the Ring passed out of all knowledge. Until when chance came, it ensnared a new bearer. The Ring came to the creature Gollum, who took it deep into the tunnels of the Misty Mountains. And there, it consumed him. The Ring brought to Gollum unnatural long life. For five hundred years it poisoned his mind. And in the gloom of Gollum’s cave, it waited. Darkness crept back into the forest of the world. Rumor grew of a shadow in the East, whispers of a nameless fear, and the Ring of Power perceived. Its time had now come.” — J R R Tolkien, The Lord Of The Rings

The stuff I’m going to mention used to be hidden and was intended solely for our own use — not yours. We believe Solaris should perform well out of the box, and that the provision of tunables amount to an admission of failure. We have never formally published or documented these things before because: we don’t think you really need them; we don’t want them to become de facto standards; and because we need the freedom to remove or change them in future. In many cases tweaking these tunables will actually hurt application performance. And they are a very blunt instrument, affecting all the threads and synchronisation objects within a process. So although some regions of your code may speed up, other regions may suffer, with no net gain overall.

Please use this information responsibly. It is unlikely ISVs have tested their applications with anything other than the default values — I know we don’t. Be up-front with your suppliers (including us) when dealing with support issues, and make sure you eliminate these tunables early from their inquiries.

So Why Bother?

With the advent of OpenSolaris some things which should have been forgotten are going to be found. Indeed, you will find the names of all the tunables I’m going to mention here.
We can’t hide this stuff any more — nor do we want to. With OpenSolaris we say “what’s ours is yours”.

Of course, there may be cases where fiddling with these tunables actually helps application performance. Of course, this may be because the applications themselves are poorly written, but it also may indicate that we have further work to do in improving Solaris performance out of the box. Please do share your data! And sharing this information with your suppliers — especially some of the debugging features — may actually speed the resolution of some support issues.

The Secret’s Out

For now, I’m just going to tell you what the tunables are. You’ll have to wait for subsequent posts (or go, grok the source) for an understanding of what they actually do and how they may be useful to you.

In
/usr/src/lib/libc/port/threads/thr.c
we find these names:

QUEUE_SPIN
ADAPITVE_SPIN
RELEASE_SPIN
MAX_SPINNERS
QUEUE_FIFO
QUEUE_VERIFY
QUEUE_DUMP
STACK_CACHE
COND_WAIT_DEFER
ERROR_DETECTION
ASYNC_SAFE
DOOR_NORESERVE

The function set_thread_vars()
scans the environment variable list looking for these names prefixed by “_THREAD_” or “LIBTHREAD_”. The latter is deprecated now that libthread has been folded into libc, but provides some compatability back as far as the alternate thread library implementation in Solaris 8 (although there have been some changes along the way which I am not going to document in these postings).
The function etest()
specifies a maximum value for each variable and ensures that any user-defined value falls between zero and this limit. The default values are defined elsewhere in the code. In OpenSolaris the tunables are defined as follows:

envvar                          limit           default
_THREAD_QUEUE_SPIN              1000000         1000
_THREAD_ADAPITVE_SPIN           1000000         1000
_THREAD_RELEASE_SPIN*           1000000         500
_THREAD_MAX_SPINNERS            100             100
_THREAD_QUEUE_FIFO              8               4
_THREAD_QUEUE_VERIFY**          1               0
_THREAD_QUEUE_DUMP              1               0
_THREAD_STACK_CACHE             10000           10
_THREAD_COND_WAIT_DEFER         1               0
_THREAD_ERROR_DETECTION         1               0
_THREAD_ASYNC_SAFE              1               0
_THREAD_DOOR_NORESERVE          1               0

All but the last one also exist in Solaris 10 (and for now, the LIBTHREAD_ variants will also work). Making a change is as simple as setting an environment variable (just remember that all envvars are generally inherited by child processes).

I hope this has whetted your appetite for next next post, but in the meantime “hey, be careful out there”!

Tag: ,

Three Abstracts Submitted for CEC 2006

The following abstracts were submitted for Sun’s internal Customer Engineering Conference 2006. Of course there is no guarantee that this material will be accepted by the CEC panel, but I’d be happy to present the same (or similar) material at other events. If you’re interested, please drop me a line.

Microbenchmarking – Friend or Foe?

Many purchasing, configuration and development choices are made on the basis of benchmark data. Industry organisations such as SPEC and TPC exist to inject a measure of realism and fairness into the exercise. However, such benchmarks are not for the faint hearted (e.g. they require considerable hardware, software and people resources). Additionally, the customer may feel that an industry-standard benchmark is not sufficiently close to their own perceived requirements. Yet building a bespoke benchmark for a real world application workload is an order of magnitude harder than going with something “off the peg”. It is at this point that an alarming number of customers make the irrational leap to some form of microbenchmarking — whether it is good old “dd” to test an I/O subsystem, or perhaps LMbench’s notion of “context switch latency”. The whole is rarely greater than the sum of its parts, but the issue often ignored is that a microbenchmark — by very definition — only considers one tiny component at a time, and then only covers a small subset of functionality in total. Furthermore, it is often observed that some microbenchmarks are very poor predictors of actual system performance under real world workloads.

Is there any place for microbenchmarking? Certainly, we need to be aware that customers may be conducting ill-advised tests behind closed doors. But should we ever dare engage in such dubious activities ourselves? In short: yes! In the right hands microbenchmarks can highlight components likely to respond well to tuning, and assist in the tuning process itself. This session will focus on libMicro: an in-house, extensible, portable suite of microbenchmarks first used to drive performance improvements in Solaris 10. The libMicro project was driven by the conviction that “If Linux is faster, it’s a Solaris bug”. However, some of the initial data made the case so strongly that we chose to adopt the Monsters Inc. slogan “We scare because we care” at first! libMicro is now available to you and your customers under the CDDL via the OpenSolaris programme. Key components of libMicro will be demonstrated during this session. The demo will include data collection, reporting and adding of new cases to the suite.

Note: I was taking them seriously about 2500 chars and two paragraphs.

Synchronicity: Solaris Threads and CoolThreads

The Unified Process Model is one of the best kept secrets in Solaris 10. Yet this “so what?” feature entailed changes to over 1600 source files. But was it all a waste of effort? For over a decade Sun has been recognised as a thought leader in software multithreading, but did we lose the plot when we dropped the idealistic two level MxN implementation for something much simpler in Solaris 9? To both of these questions we must answer a resounding “No!”. Indeed, the Unified Process Model, under which every process is now potentially a multithreaded process, was only possible by a simpler, more scalable, more reliable, more maintainable, realistic one level 1:1 implementation. And all this goodness just happens to coincide with the CoolThreads revolution. As other vendors chime in with CMT, Solaris is streets ahead of Linux and other platforms in being able to deliver real benefits from this technology. It is extremely important that we are able to understand, articulate and exploit this synchronicity.

Note: this time I realised that they didn’t really mean 2500 chars!

DTrace for Dummies

Wonder what all the fuss is about? Need a good reason before you engage your brain with this stuff? Think this may be one new trick too far for an aging dog? Just curious? Then this session is for you! We have a reputation for making DTrace come alive for even the most skeptical and indifferent of crowds — D is certainly not for “dull” at our shows! Don’t worry, we won’t get you bogged down in syntax or architecture. But we will convince you of the dynamite that is the DTrace observability revolution — that, or you are dummer that we thought! Everything you see will happen live. We don’t use any canned scripts. Anything could happen. You’d be a fool to miss it!

Notes: This was a joint submisson from me and Jon Haslam. We’ve found our combination of sound technical content and brit humour very effective at getting across the DTrace value proposition to a wide audience. We first did our double act (Jon types while Phil talks) at SUPerG 2004. Following rave reviews we were asked to present a plenary session at SUPerG 2005.

Technorati Tags: ,

Roy Chadowitz, 1955-2005

I started at Sun at the end of February, 1989. I think Roy joined within a month of me. When I made the jump from Customer Services to Sales Support, he was my first manager. I owe him a lot in terms of my personal development. He was always something of an older brother figure and mentor to me. Roy always showed a genuine interest in individuals and their families. He did all he could to make working for Sun more family-friendly. He set a good example of working hard whilst having fun, and without forgetting that there was much more to life than work.

The only criticism I ever heard of Roy was that we was very tight with money. Sometimes he seemed to give the impression that our business expenses, equipment budget and so on came directly out of his personal savings. But he always applied the same control to his own spending — only more so. Even when he was home bound it was hard to get Roy to spend even modest amounts on the equipment he really needed to do his job. I guess the chipboard coffin is standard Jewish practice, but I’m sure Roy would have heartily approved!

But what sticks with me most if Roy’s patience and endurance through so much suffering. He had so many secondary issues to deal with due to his early cancer treatment (I guess they went over the top on that because they didn’t think he had long to live).

I think it is worth telling the story of Roy being given 6 months to live — 10 years ago. I think we need to record his 50th birthday. The story of Roy tripping on a marquee guy rope and the walking around with a broken neck for at least 6 months is remarkable. If I recall correctly, he was in a lot of pain, but they put it down to his cancer. Roy proudly showed me his x-rays (first the beautiful intricate work of the neurosurgeons supporting his broken next, and then the functional rods inserted by the orthopedic chaps). Later he showed me the x-ray of the electrodes wired into his spine for pain relief (something like a cross between a TENS machine and a pacemaker).

He endured so many operations, so much pain, and so many setbacks. But he also made huge progress and frequently amazed us with with determination to bounce back and continue with his work. I will always remember seeing Roy in “The Directors” line-up at the 2004 kick-off meeting in Nottingham.

The last year was hard to watch. It doesn’t seem long after Nottingham that the sudden paralysis came. It was hard to see Roy brought so low. But he wouldn’t give up fighting, even then. Sometimes we felt he needed to stop work, but it had to be his choice. And as we watched, it seemed clear that Roy knew he would decline very rapidly if he had nothing to do, nothing to aim for any more. Very rarely did we hear any complaint.

It was my privilege to be welcomed in to the Chadowitz home on many occasions. Carol always kept me well fed, and Simon was often good company during the long waiting periods which come with certain system admin chores. I remember the excitement around Lee’s Bar Mitzvah, and Roy’s pride in explaining it all so me. What an amazing family! So much practical love. There were dark moments — times when it was clear that Roy was troubled about the burden he had become, but no-one would hear such talk! The love that Roy and Carol shared was an inspiration to see.

I last emailed Roy the day after his 50th birthday (although I didn’t know that at the time). two days before he finally left us. That was the first time I didn’t get a reply. When Greg phoned on the Saturday evening it was the call I had been dreading for years — 10 years. I’d expected it, but somehow Roy’s determination to keep going had lulled me into a false sense of security. It was like hitting a wall.

I had got used to seeing Roy confined to his bed. Even then I’d seen him struggle — and improve! But I’d been away for too many weeks (business trips and heavy colds kept me away). I didn’t see Roy at his lowest. I didn’t see him when he was unable to speak. That would have been intolerable for him. He always had so much to say. So much to give.

His most common parting shot to me was “Keep smiling, Phil”. It’s hard to do that just now, Roy, but I’ll try!


Note: I have closed comments on this entry due to comment SPAM.
Please email me directly if you want to add something further.

blogs.sun.com on BBC Radio 4

I just happened to listen to this: “BBC R4 16:00 Shop Talk – From online consumer blogs to mobile phone pictures being broadcast as TV news; the public are using technology to tell their side of the story. Heather Payton and her guests discuss the impact ‘citizen’s media’ is having on big business.”

Great programme about blogging in general, and the commercial opportunities for blogging in particular. Lots of excellent input from our very own Simon Phipps (great job, Simon)! And we can all give ourselves a hearty pat on the back for being part of such an exciting story 🙂

Look out for the “Listen Again” link via this page: http://www.bbc.co.uk/radio4/news/shoptalk/ (today’s programme should appear later today, and hang around for about a week).

Keep blogging!