All these programs are the Past. Threads are the Future.
Two motivations conspire to squeeze off the possibility of workable unthreaded platforms. What are they?
First, fashion! ``Unthreaded platforms,'' I say...what's a platform, anyway? It's where all the money comes from in selling computers and software. Keep people on the platform, and they can be controlled and milked for cash. Nevermind how to do the milking---that solves itself. Just keep them on the platform, and milk...er, cash, flows. Java is a platform. Windows is a platform, but whenever the scum at Microsoft builds some new ramshackled asbestos-laden shanty hut onto the side of Windows, the new thing becomes an even stricter platform: Office is definitely a platform. .NET is the biggest sheep-herding exercise in decades. There are either/or platforms, and there are nested platforms where one is built upon the other, but always a platform has this containing feature so there is major stumbling and rework when you try to move off it. Ever feel like Mac users are a cult? That's the platform-building in action. Giant web sites like Amazon, Google Maps, Facebook that have ``API's'' are trying to become platforms. Platforms == riches.
It's easier to communicate with others if they're on the same platform. You can borrow programs others are running as long as you don't have to cross platforms. Data will be richer if it hasn't crossed platforms. If the ``data'' is an email from your friend telling you how to fix your computer, that won't work as well if the email has crossed platforms, either, which leads right into the next point about platforms.
More importantly, you can borrow/steal developers from others on the same platform---if someone's been stuck on a platform for a long time, if his mind is infected with it, then he'll work more efficiently if you don't try to relocate him. People get stuck on platforms by creating the platform just as much as they do by merely using it. The bigger it becomes, the more there is to relearn, reimplement, adapt, if you try to leave. The platforms of tomorrow are the ones taught in third-world for-profit tech schools. The old Unix tribe is dying out, mostly because we've been outmaneuvered in indoctrinating the next generation of worker drones.
What do threads have to do with platforms? Well, lots of platforms
have threads now. Java and Windows do for sure, but also Perl and
Python and, well, almost anything new. Without threads, you're kicked
off the platform. And we create the threaded uberplatform by
expecting to have threads whenever we write new programs---even if we
don't have .NET Win_LPCreateThreadEx32()
, we want to have
some function that makes a familiar thread-like thing, and we
whine if we don't get it. Porting programs across platforms, even big
ones like Firefox, is doable, but porting a threaded program to an
unthreaded platform is impossible. Firefox needs threads on all
platforms where it runs.
To be fair, threads are useful. They make some programs simpler to write. But that's not why they're the future. They're the future because all the (fewer every year) politically powerful platforms of the future include them.
Second reason: performance. Our timesharing system can accomplish more computation if it has many runnable contexts to spread among CPU cores instead of just a few. An unthreadded program has only one context, and can only keep one CPU busy. The spreading out of single programs onto many CPU's can take many forms besides threads, but most of the ideas for making computers of the future faster depend on this spreading, and threads are probably the simplest imagineable model. Here are some of the ways having multiple threads can speed up computing:
The physicists and chemists who design CPU fabricating robots can efficiently make chips of a certain maximum size, and making smaller chips is only a little bit cheaper. CPU designers have to fill up the chip with something to make it faster, or they'd be wasting space. They tried so many tricks to waste space: gobs of on-chip L2 cache, silicon instruction set translators to let us hold onto the ancient and poorly-performing i386 instruction set, almost-never-used SIMD vector units (UltiVec, VIS), system-on-a-chip stuff like memory controllers, PCI controllers, HyperTransport switches, gigabit Ethernet MAC's.
Finally, there was so much extra space designers couldn't figure out any plan no matter how crazy for using up the space, so they copied and pasted their entire design and started giving us two CPU's on one chip whether we like it or not. Now we're stuck trying to think of some way to keep two CPU's busy.
But why do all that? If we have four runnable threads, let the CPU multiplex them. It can work on all four at once, each running at about 1/4 normal speed. There's no need to order instructions cleverly or build assembly lines with complicated interlocks if the CPU can choose among four to which runnable thread it wants to assign each subunit. Doing so leaves more to chance than the expensive C compiler, but might still work better.
These new i386, UltraSPARC T<n>, POWER5 CPU's that accept two to eight contexts per core also give some flexibility to the workload the overall machine will handle performantly. A load that always has many more runnable threads than CPU cores will cause lots of context-switching overhead, and maybe also L2 cache thrashing in a cheap machine, and TLB thrashing unless you have a fancy CPU with ASI's (not i386). Too many runnable threads can introduce overhead. At the other extreme, a load with fewer runnable threads than CPU cores will leave CPU's idle, wasting the performance they'd contribute to the overall job. Ordinary large, multi-CPU machines, ones that take a single thread context per CPU core, compute efficiently when they have a number of runnable threads between the two extremes. By making a CPU core into something that accepts 1 - 8 threads, we may expand the zone of load average in which the machine will be efficient.
so, this strategy is somewhat self-validating: it makes computation more efficient iff one can feed it a threaded workload. It also maybe does a better job than older designs if you're stuck with a heavily-threaded workload whether you like it or not.
Anyway, the large clusters that are working and scalable right now AFAIK use separate processes on each node, not threads. I'm not a big cluster fanboy, so I could be wrong. It's hard to be right because I don't know of a cluster test-suite that exercises all the typical weak spots of a cluster and then gives some comparable numerical result. Instead, the physical design of the cluster defines the API you have to use, and the computation has to be written to this API, so large clusters aren't interchangeable thus aren't comperable, and no standard test can be written. The results they always quote are ones which leave the interconnect completely empty and exercise no weak points at all. What threads require---cluster-shared memory---is typically the most glaringly gigantic weak point of every cluster, the feature that's been most elusive, that if you try to use it will leave the cluster mostly-idle.
But we'll see, I guess. The insides of single computers are starting to look increasingly cluster-like, and low-latency switching is maybe a little less exotic and expensive now because of SAN's (for some values of ``low'').
How are threads implemented: part one, experimental M:1 systems
Threads are a programming language idea, so it's wrong to talk about
their implementation except as an entire stack extending all the way
up to a high-level language. But, in general, we do. We only care
about implementing threads in C. We don't need to look into other
languages like Perl, Java, Python, Lisp where threads all exist mostly
through adaptation layers over the POSIX threads for C API, sharing
whatever limitations C threads have on that platform. (There might be
a small CPU- and ABI-dependent piece in the notC language. I'm not
sure.)
Let's talk about some of the implementations of POSIX threads for C, on Unix.
The first reliable thread implementation for Unix was MIT pthreads (now GNU Pth is something like it). Pth exists entirely within the userspace context of a single process, with no help from the Unix kernel. No one uses Pth any more. Why? Not modifying the kernel is the cornerstone of the Pth architecture, and it's a decision that introduces unsolveable problems. Here are some problems with Pth which are severe enough to make it a not good-enough foundation for a modern threaded platform:
write(2)
---hopefully
a few calls account for most of the blocking. Most syscalls are not
wrapped.
Processes can block because of the VM subsystem because of changes to
the resident set, or because of a read from an mmap(2)
'ed
file, and this will also block all threads in a Pth program. Pth
could never help with this, while kernel-provided thread
implementations should keep running other threads in the same process,
in that case.
pth_select(3)
,
pth_yeild(3)
, or pth_mutex_acquire(3)
. Some
userspace threads libraries other than Pth attempt preemption, but I
don't know if it can be done efficiently. Pth
doesn't
try.Preemption is valuable in two ways. First, threads that do long, interruptable computations can be timesliced so they each get a little bit of work done. Sometimes this is really good, and when it's not good it's usually only a little bit bad (a little slower).
Second, it's necessary for expected real-time behavior. Suppose a
user-interface thread is blocked in select(2)
while a
long-running computation thread uses the CPU. Then, one of the file
descriptors in the select(2)
has activity, unblocking the
thread. A Pth program can't run the select(2)
thread
until the computing thread calls pth_yeild(3)
or some
other Pth function, but a Kernel thread implementation can
immediately, possibly even mid-syscall or mid-instruction, take the
computation thread (which has probably more than used up its timeslice
by now) off the CPU and schedule the user-interface thread, thus
drastically reducing the latency of the program's response to the
user. Some other userspace thread implementations may try to do this
using the POSIX asynchronous I/O API which predates threads and is
based on signals, but this async I/O trick, albeit a fantastic hack,
doesn't cover all syscalls that can block, and seems to have not
worked out very well in practice.
Modern programs are written expecting this feature will be there, so
they basically never call pth_yeild(3)
. This is
something we need to do basic, expected things, like cancel a print
job or switch browser tabs while a complicated page is rendering.
GNU Pth is important because, even after Linux and *BSD had kernel-provided thread implementations, they provided incomplete or buggy and broken implementations for many years. Linux took a very long time to switch over to NPTL, which is their third-try kernel thread implementation (fourth-try? at least LinuxThreads and NGPT came first.) which unlike earlier ones actually aims to be mostly correct. Even after NPTL existed and people started bragging about it, it took a couple more years to make it onto release-engineered desktops and servers.
Likewise, BSD still has buggy kernel threads---as of 2008 this applies to at least:
If the OS keeps both presented CPU's within a single core busy, each of the two presented CPU's will run a little bit faster than half-speed. Suppose we have four runnable threads in our hypothetical four-core HyperThreadded system. The OS can't treat the eight presented CPU's as equivalent---it needs to understand the difference between cores and hyperthreads. Otherwise, it might schedule all four threads onto two cores and leave the other two cores idle.
This sort of thing does actually happen, so unless you take the time to test it or for some reason are certain it won't happen, you need to turn HyperThreading off for any multi-CPU Unix system.
Of course, the latter two problems aren't worse than Pth, but the first two are. The point is, Pth is finished, and kernel threads in free Unix still aren't! Linux I think is almost finished, and maybe FreeBSD/i386 but I'm not positive. NetBSD 5.0 will probably be finished.
How are threads implemented: part two, mature systems
What Unix definitely has a finished thread implementation as of 2008?
Linux and I think {Free,Open}BSD, but on i386 only, were all mostly
finished (watch out for HyperThreadding, broken gdb's, missing libc
functions, low performance on certain workloads) a couple years after
2006. And any vendor Unix, like AIX, OSF/1, Irix, Solaris. They were
all finished as of 1997. Let's have a look at what those latter guys
are doing. or, what they did a long time ago.
There was once serious debate about how to make the most efficient thread architecture. Here are some of the old proposals:
This upset people because Pth is older, yet faster, and usually one feels we should be moving unambiguously forward. Also, Java has lots of threads, and Sun cares a lot about Java performance. Java's green threads, which are equivalent to Pth M:1 threads, often ran real Java programs faster than native 1:1 threads. In spite of its fundamental limitations, the M:1 model seemed practical for real work.
Windows NT uses 1:1 threads only. Linux's many kernel thread implementations---LinuxThreads, NGPT, NPTL, whatever else---are all 1:1 implementations.
If there are no runnable sleeping threads among the M, the userspace
scheduler can call into the kernel to terminate the LWP, decrementing
N. Likewise, the kernel tries to keep N as small as possible, but
large enough to busy any idle CPU's. If one of the threads makes a
syscall that blocks, like read(2)
for example, the kernel
may increase N---syscalls are managed by leaving the whole LWP
blocked, so sleeping in a syscall uses up an LWP for spreading-across-CPU's purposes. The exact mechanism for increasing and decreasing N that Sun
uses in their ultimate LWP implementation, scheduler activations, is
based on an ``upcall'' which is something like a BSD signal, and is
not something I understand well. There's a mechanism for
accomplishing the same job that I find simpler to understand, Masuda &
Inohara's
Unstable
Threads, but no one has implemented that one in a real operating
system.
Sun had threads earlier, but completed their mature M:N architecture, with scheduler activations, in 1997 with Solaris 2.6. An early version of OSF/1 (before it was called Tru64) also had scheduler activations, though I'm not sure exactly when. NetBSD implemented scheduler activations in the 1.6 formal release, but never got them working well. 1.6 was the first release of NetBSD with any kernel-supported threads, and from 1.6 through 4.0, M:N threads are the only kind available.
FreeBSD has user-chooseable schedulers. There's a 1:1 scheduler and an M:N scheduler. I'm not sure which one was finished first. I believe the M:N scheduler is the default on i386, and the 1:1 scheduler is the default and only one working on sparc64 (please correct me). There is debate about which one is faster. I think M:N is sometimes much faster and sometimes much slower than 1:1. They believe there is a lot of performance work to do on both schedulers.
Linux and Windows NT have never had M:N thread implementations.
Sun later offered an experimental 1:1 thread implementation. They worked hard on optimizing its performance, and found it can match or beat their ``LWP'' M:N implementation, and it's much easier to keep bug-free. FreeBSD still has a choice of 1:1 or M:N (XXX---what are their names for the two choices, and how do you pick one?). I think FreeBSD is likely to fix the cases where their 1:1 scheduler is much slower than their M:N scheduler first, so that it's at worst only a little slower. When that happens, sysadmins will manually switch to 1:1, then it'll become the default, then M:N will be discarded---exactly like Solaris, except that it's happening eleven years later, and is happening before they've delivered any performance-wise mature thread implementation.
NetBSD has given up on scheduler activations. They will rip it out, and will not finish its SMP support. 1:1 threads are the only kind available in NetBSD-current, and I believe will be the only kind supported on NetBSD 5.0. Discontinuing support for something will break binary compatibility, which they care about rather a lot. I believe their plan is to support thread-using NetBSD 1.6 - 4.0 binaries only if dynamically linked, through a replacement libpthread.so. However, since I'm not sure if that's working yet on NetBSD-current.
In Solaris 10, the old LWP scheduler has been garbage-collected. 1:1 threads are mandatory.
After all this dust settled, OpenBSD eventually quietly delivered some kind of 1:1 threads, I think. I'm not sure of the details, or how well it works on not-i386.
And Linux, of course, delivered 1:1 threads long before the dust settled (but got it horribly wrong at least twice, and wasted mountains of time of IBM software developers trying to do politically acceptable works on doomed projects).
Summary
Solaris and other proprietary Unixes had bug-free threads a full
decade before any free operating system. Linux was the first free
operating system to have decent, reliable, performant POSIX threads,
though they offer them after a couple of heinously ugly false starts
that never happened on proprietary Unixes.
1:1 thread architectures are the easiest to make mature/bug-free, complete, and decently performant on all workloads. Both M:1 and M:N architectures are abandoned, industry-wide.