What's the difference between an L3 switch and a router?

The first thing that needs to be said on this point is: there isn't a difference. However, these devices aren't just interchangeable lego blocks with a few attributes like ``number and kind of ports'' and ``max forwarded pps.'' You may need to know more about what's going on inside them to avoid surprises.

All forwarding is switching. Routing is something else.

A router, in Cisco's terminology, performs two functions: switching, and routing. I think this is confusing to Linux sysadmins or D-Link/Linksys/Netgear ``works-for-me'' SOHO firewall users who think that ``routing'' means forwarding packets at L3, and ``switching'' means forwarding packets at L2. This person might think a ``Linux router'' is a PeeCee with more than one Ethernet card, on which he's enabled IP forwarding and installed a few entries in the routing table with the route add ... command. This is the wrong term for us. Likewise, he might think a ``switch'' is a device with no user interface that he jacks into his network the same way he used to wire a hub. This might be right, but if he thinks that switching must be no more or less than the job this unmanaged switch does, again, that's wrong for us.

For us, forwarding and switching will be synonyms. In those examples, no routing is happening, only switching. Accepting a packet, consulting a table to decide where the packet goes next, and transmitting it there, is switching.

Protocols that decide what belongs in the table are routing protocols. RIP, OSPF, and IS-IS are examples of routing protocols, and the device that implements these protocols is doing routing.

The data plane and the control plane

Routers have a data plane and a control plane. So do switches. The control plane is a traditional computer running some operating system probably loaded from FLASH memory, and probably written mostly in C. It presents the device's user interface over something like telnet, ssh, or serial port, and it does most of the work aside from switching. In a router, the control plane implements RIP, OSPF, or IS-IS. In a switch, it implements STP and IGMP snooping. Control plane functions can't involve moving huge amounts of data around, and they also can't have latency requirements tighter than tens of milliseconds (even an expensive device like the Cisco 6500, for example when the control plane is implementing BFD, can't reliably respond in less than 100ms).

The data plane is much weirder. It contains many CPU's, but none of them wired like a traditional computer. They'll be running tiny programs written by joyless electrical engineers in disgusting pseudoassembler languages, stored in ad-hoc memory devices. The CPUs will probably be Harvard architecture because electrical engineers are comfortable with arbitrary rigidity and have cargo-cult design habits. They will call their programs ``microcode,'' but in general this label is unfair. For example most gigabit Ethernet MAC chips include one or two MIPS cores, the same instruction set used in the old SGI workstations, the Sony Playstation 2, and the control planes of many routers and switches. And if you look at the ``network module'' of a low-end Cisco router, which is part of the data plane, you'll probably see a recognizeable discrete CPU like a 68k, an 80186, an i960---something like that (XXX -- I should dig for what they actually use). The electrical engineers call it ``microcode'' because they're too dumb to get a compiler toolchain working, so they try to imply it would be inappropriate to want one.

There are other non-CPUish things on the data plane just as there are non-CPUish chips in an ordinary computer or in a control plane, but they're wired abnormally around the flow of packets through the router or switch. There will be FPGA's (look for chips stamped Xilinx, Lattice, Altera) and custom ASIC's with the router or switch vendor's name stamped right on them. Many of these FPGA's and ASIC's do have some strange task, yes, but you'll find each chip often spends a quarter of its silicon real estate on a RISC-ish CPU core like ARM or MIPS. CPU-ish and non-CPUish things are often in the same chip, but instead of one big pool of memory there are smaller pools each with different, specific purposes.

The custom logic and so-called ``microcode'' of these many CPUs on the data plane act in concert, accepting simple direction from the control plane, to perform switching. Because the CPUs run small programs with tight loops, they probably don't do much interrupt handling or context switching, maybe none at all. This keeps the data plane's latency extremely low, on the order of the transmission time of a single packet.

What's the device called? What's the data plane doing? What's the control plane doing?
Unix PeeCee router doesn't exist L3 switching
sometimes L2 switching and STP
probably not doing routing (but can, Quagga/XORP)
Linksys/D-Link/Netgear junk L2 switching L3 switching
(doesn't do routing, nor even STP)
L2 switch L2 switching STP/IGMP, SNMP
L3 switch L3 switching, L2 switching routing, STP/IGMP, SNMP
router L3 switching
sometimes L2 switching
routing, SNMP
sometimes STP

If you run Quagga on your Unix PeeCee, the above table isn't correct for you. Quagga running in userspace is sort of like the control plane in a real router, and the kernel IP stack is sort of like the data plane. The routing socket API that Quagga uses to talk to the kernel is similar in purpose to the interface between the control plane and data plane in a real router. However, I think this analogy harms understanding more than it helps! The Unix PeeCee still doesn't have a data plane. These words are reserved for physical wires inside the device over which packets flow, and chips that implement functions. In a Unix PeeCee, there is a single plane---packets flow through the CPU, through the same wires and chips that are implementing OSPF and other control plane functions. Some real routers with separate planes also have, within the control plane, a boundary between kernel and userspace---this boundary is a feature of the real router's control plane that's generally none of your concern. The interface between the data and control planes is something physical, not a software API or system call vector. And, adding to the confusion, some real routers with data planes can forward packets on the control plane when the packets need special mangling. Cisco calls forwarding packets through the control plane ``process switching.'' On a Unix PeeCee, everything is process-switched.

Fancier devices can push more of their function into data planes. L2 switches---the simplest data plane---can bridge Ethernet frames out the right port based on destination MAC address, and I think the learning function, where the switch associates ports with destination MAC's by watching the source MAC's of frames it receives, also happens in the data plane.

L3 switches can pass IP traffic in the general case. The data plane has to read past the Ethernet frame header to the IP address and TCP/UDP port numbers. It has to rewrite the Ethernet frame's destination MAC address with that of the next-hop it's looked up in the ARP table, and replace the source MAC address with its own. Because of this rewriting, it has to recalculate CRC's for the Ethernet FCS, which L2 data planes don't need to do. It also has to decrement L3 IP TTL's (no more storms!). And because it has to decrement TTL's, it has to calculate and rewrite both IP checksums and Ethernet CRC's, which L2 switches don't have to do either. It has to do all these things. But there are tricks. The data plane of both the Cisco 6500 and the Extreme i-series can't forward the following types of packets in the data plane, which means they're punted up to the control plane.

Special packets

Some of them are listed below, but there are a lot of ``special'' packets. On Cisco, type show mls rate-limit to see brief, incomprehensible names for them. There are lots of gotchya's and corner cases. For example, consider the following sentence: ``Layer 2 rate limiters do not work in the truncated fabric-switching mode: that is, they do not work in a system that has a mix of fabric-enabled and traditional, non-fabric-enabled line cards or if the system has classic line cards and redundant Cisco Catalyst 6500 Series Supervisor Engine 720s.'' What does the sentence mean? I could look up the different switching modes of the 6500 and give you some more incomprehensible sentences to explain the first one, but the short answer is, we netadmins will probably never fully understand that sentence, and further understanding isn't helpful without experience mitigating DDoS's.

Anyway, here are some comprehensible ``special'' traffic types that both Cisco and Extreme switch on the control plane instead of the data plane.

In L2 and L3 switches, the control plane's connection to the data plane is Ethernet. Current L3 switches, the ones that have numbered slots and modules not the fixed-configuration ones, usually have a 1 - 4 Gbit/s connection to the data plane. The data plane itself will have 32 - 720Gbit/s capacity. My Extreme switches have 64Gbit/s, but unlike Cisco all Extreme switches seem to have a data plane capacity big enough to handle line rate traffic in and out of every port so we keep less careful track of this spec. It's not hard for an attacker to fill up the control plane's couple-gigabit Ethernet connection by sending one of the above kinds of special packet through your switch, which sort of defeats much of the point of buying a switch with a line-rate-on-every-port backplane. Cisco has a decent answer to this problem---they give you an evolving, rather arcane, toolbox for marking and policing traffic that flows through this control plane gigabit interface. (Cisco calls the control plane the MSFC. The PFC is part of the data plane, and the EOBC is a gigabit Ethernet interface between the PFC and the MSFC.) They focus in their paper on this toolbox because it requires your configuration, and because it's only gotten useful in their latest management modules. But the most important feature in Cisco's DDoS white paper, a prerequisite for all the others, is the one they mention briefly at the end: CEF. Extreme doesn't have a good answer to DDoS by packets on this list. They can't, because they don't have an equivalent to CEF.

Flow forwarding vs. longest-prefix

Extreme is a flow-forwarding platform, so there's an additional type of traffic that an Extreme data plane will punt, one which needs a bit of explaining. First, let's ask, how is L3 traffic switched, conceptually, with the most obvious on-paper-only implementation? The behavior is driven by a lookup table. The key is an IP address and a prefix length (subnet mask). Big subnets are allowed to contain smaller subnets, so two rows in this table might together mask overlapping address regions, and the correct entry in the table isn't just any prefix that matches the destination address---it's the longest matching prefix and only the longest matching prefix. That's the table's lookup key.

The value the table returns is an output interface and an optional next-hop IP address. If the table includes the optional next-hop value, the right way to forward the packet is, look up the next-hop with ARP, set the destination MAC to the one given in the ARP reply, and then dump the frame onto the wire indicated by the output interface field. If the next-hop is missing, instead look up the destination IP with ARP, and use that ARP reply to overwrite the destination MAC address.

So far we've described two tables:

routing table: {destination IP, prefix length} -> {output interface, next-hop IP}

ARP cache: {wire IP, output interface} -> {destination MAC address}

show ip cef only shows the first table, but in practice, I suspect Cisco combines the two tables. Next-hops that don't have destination MAC's from ARP yet are marked with the CEF glean adjacency and sent to the control plane for buffering. (another nasty DDoS vector.)

These two tables are enough to implement L3 switching, but not what we call routing on this page (OSPF, IS-IS, BGP). Unlike the way we usually think of Unix routers that I've just described, real routers contain more than one copy of a routing table. The RIB is a table of L3 destinations and next-hops kept inside the control plane. The RIB can contain more than one tuple for a single destination, and when this happens some of the routes to a given destination may be inactive, because they don't have the best metric. On Cisco, show ip route pretty-prints the RIB, and punctuation characters mark the active and inactive routes. The FIB, copied from the RIB, contains only active routes, and contains no metrics.

The FIB might still contain multiple routes to a single destination, if the data plane is intended to load-balance traffic across two interfaces or next-hops. The most accepted way to do load-balancing is to key the FIB lookup off a hash of the TCP and UDP port numbers---this guarantees a given TCP circuit will follow the same path, unless control plane's OSPF/IS-IS/BGP routing protocols decide to redirect it, which won't happen often. This insures the packets belonging to that TCP circuit arrive in order (which isn't important, but some people think it is, and other people design very stupid not-Internet-ready UDP protocols where in-order delivery actually is important). The RADIX multipath scheme can also reduce flows' RTT variance (``jitter''), which is nice for TCP and critical for VoIP.

On Cisco, show {ip,ipv6} cef is one way to pretty-print something like the FIB. The FIB more or less belongs in the data plane, but as we'll see, this is arguable.

The two tables I described above aren't enough to implement what people now expect of a router (or even an L3 switch). Policy routing demands putting more values into the lookup key of the FIB such as: source IP address, IP TOS, packet size, maybe TCP or UDP port numbers. Access lists mean sometimes the output interface will be ``discard this packet.'' Unless the access list or policy routing rules can be translated (``compiled'' or ``optimized'') somehow, we may need to add some kind of sequence number in the lookup key because policy routing and access lists both have a first-match character, which means modifying the longest-match lookup rule. We are already maybe needing strange kinds of specially-constructed silicon memory to store the FIB in a way that longest-match lookups are O(1). Achieving O(1) lookups is even harder with all this policy routing and access list stuff. In general it's not acceptable to have rules that access lists can't be ``too long,'' so O(1) lookups is a requirement.

The Riverstone 15000 Edge Router was one of the first L3 switches. It doesn't have the word ``switch'' in its name, but with the benefit of hindsight we can see, that's what it was. Riverstone's idea was flow forwarding---a sneaky way to get these policy routing and access list features while using much less fancy FIB-storage memory chips, less fancy even than chips capable of longest-match.

A ``flow'' is now an accepted, first-class entity in the L3 world since the ``flow label'' field got put into IPv6. The IPv6 flow label is at layer 3, not layer 4 or 5 or whatever silly OSI-layer-model-memorizers will tell you. It's missing in IPv4, but in that case it can be approximated by the header hash of protocol and port numbers I spoke of earlier that's useful for multipath, multiple FIB entries for a single destination, and for implementing WFQ output queues that punish poorly-behaved flows that aren't using good congestion-backoff algorithms.

well, whatever... I don't mean to ramble. The point is, the flow idea is already used for more than just flow switching. A flow is supposed to approximate a single conversation, like an established TCP circuit, or a UDP streaming source sending a string of packets with the same UDP source and destination ports and no more than a few seconds between packets. Riverstone's idea is to optimize their data plane for long flows. They replace traditional FIB lookups with a flow lookup in a table that has one entry per flow. The first packet of each flow gets switched by the control plane. The traditional FIB then stays in the control plane, or it can be a nonexistent thing derived from the RIB on demand. The FIB in the data plane is replaced with a route cache or a flow cache. I think Extreme calls it an IPFDB.

This has some cheap, convenient consequences. The data plane doesn't have to implement access lists because only permitted combinations of source, destinaiton, protocol, and port number will be entered into the route cache. The data plane doesn't need to implement load balancing, either, because the control plane can implement best-common-practice RADIX load balancing by giving different next-hops to the tens, hundreds, thousands of route cache children that may spawn from a pair of RIB entries calling for a load-balanced route.

As far as I can tell, the Extreme IPFDB route cache always contains fully-specified entries with:

{input interface, source IP, destination IP, L4 protocol (which can be only TCP or UDP), L4 source port, L4 destination port} -> {output interface, L2 destination MAC, boolean: rewrite or don't-rewrite the source MAC}

(With flow forwarding in general, sometimes input interface is part of the table, and other times there is a separate table for each input interface.)

Extreme's fully-populated-route-cache requirement (which might be a fantasy of mine, not an actual requirement, but...) means, they forward anything that's not TCP or UDP through the control plane. That includes ICMP pings passing through the switch, GRE tunnels, IPsec---any IP protocol besides TCP or UDP. I find the idea of putting something like that on the public Internet a little chilling! I don't actually have a fast enough Internet connection for it to matter. I also don't enable ipforwarding on any Extreme VLAN that's in front of my firewall. The Extreme stuff works well for me, but this flow forwarding idea has some drawbacks.

There's another problem. Flow-forwarding switches aren't good enough for certain traffic mixes. Suppose the traffic mix is in front of a load balancer (which must be a lot more expensive than your flow-based L3 switch!), a few gigabits worth of web users reloading slashdot.org or search.yahoo.com. Here is the first problem: the flows are short. They're probably only 4 - 10 packets for each connection, and in this case one connection is two flows. The route cache on the data plane doesn't have infinite size---it will fill up, or ``thrash.'' And the first packets of these flows will overwhelm the control plane.

You might say, ``well what if we don't store so much in the route cache. What if we don't need to implement policy routing and access lists? We just want to do BGP on the switch, and any other fancy stuff we need to do will be the load balancer's job.'' Riverstone let you control how many fields go into the route cache lookup key. ip set forwarding-mode destination-based implements this what-if, but the Riverstone route cache still doesn't contain longest-match destinations. It contains IPv4 /32's, and the control plane still has to put them there. If your pool of webservers is sending back TCP SYNACK's to web browsers all over the Internet, the control plane will still be overwhelmed by first-packets, and the data plane route cache will still overflow with cached entries. It's not solveable this way.

The answer is to pay for a device that has a fancier longest-match FIB in its data plane---Cisco 6500 with a modern supervisor, instead of Extreme Alpine/i-series or Riverstone 15000. Extreme even sells this bizarre longest-match module for the BlackDiamond, but it still doesn't seem able to switch the backplane's full bandwidth in longest-match mode.

Even a longest-match FIB will still have a limited size, and this matters. It's actually worse in a certain way. For example, for the Cisco 6500, to run a default-free full BGP table, you have to buy the PFC3BXL instead of the PFC3B. If you are not running a full BGP table, you don't have to buy the XL version of the PFC, though they still tried to sell us one. Now, suppose the traffic mix is made of longer flows, big data transfers, instead of reloads of tiny web pages, so Extreme's flow-switching is fine. But suppose this time the L3 switch needs to run a full BGP table from two or more different ISP's. In this case, the control plane needs to have enough DRAM to keep a full BGP table in the RIB, but the data plane's route cache sizing only cares how quickly flows are brought up and torn down, not a bit about how big is the RIB. Extreme flow-switching has another cost advantage here. On Cisco, not only does the MFSC still need enough DRAM for a full-feed RIB, but one also needs the XL version of the PFC to store the >200k rows of longest-match FIB in the data plane.


Finally, after all this background, we can get to the difference between an L3 switch and a router. Here it is: a router can stuff packets into an IP-IP or a GRE tunnel, and forward them on their way, entirely on the data plane. An L3 switch can't.

Some routers will do IPsec on the data plane, but only the small ones. (discussed below)

In general, the data planes of routers are slower and more flexible than those of L3 switches. However, I think this is less true as you pay more money for the device---as we've discussed already, the most expensive L3 switches have longest-match FIB's instead of flow-forwarding (Extreme) or an optional exact-match mode (Riverstone). Likewise, the most expensive routers can switch small packets at line rate, but have lots of quirky limitations about which features one can use with which line cards.

Unix PeeCee router L3 switch Old-style traditional router
has no data plane has data plane, but might do flow-cacheing always has a longest-prefix FIB in the data plane
can have about four gigabit Ethernet interfaces a few hundred gigabit Ethernet interfaces are possible The ``old'' ones I'm talking about here would never have more than five or ten gigabit interfaces.
can forward about a million packets per second. doesn't matter if they're big or small packets. relatively unexotic configurations can forward line-rate packets of any size in and out every interface often unable to forward tiny packets at line rate.
QoS is horribly unreliable, and there is a large amount of unpreventable jitter you have to put up with. In short, QoS, if it works at all, is only helpful on <10Mbit/s links. However, extremely complicated and IMHO useful link-sharing policies are implemented in Linux and BSD, such as HSFC. QoS is based on the DiffServ model: marking and policing is a separate action from scheduling. While complicated marker/policers are possible, only the most primitive scheduling is available---strict priority scheduling with RED. There are some class-based schedulers, but I don't know of an HFSC implementation.
Long latency. Lots of jitter. crap. low latency, low jitter low latency, low jitter
statistics collection not reliable. counters poorly-marked and full of lies. reliable statistics, per-port and sFlow/NetFlow, collected by data plane reliable statistics, per-port and sFlow/NetFlow, collected by data plane
mostly has no silicon problems with BGP full feed. Usually the FIB can fit inside the CPU's L2 cache. flow-forwarding platforms like Extreme don't care if they have a BGP full feed or a small routing table. Longest-prefix platforms like C6500 need the PFC..XL cards to forward packets with a full BGP feed. no one ever discusses the CEF table capacity of a line card in a low-end router. I'm not sure why this isn't a problem, but line cards manufactured when the BGP default-free zone was 1/10th its current size are fully expected to do CEF forwarding with a full BGP feed, and to give the same performance they did when they were new. I need to understand the data planes of low-end routers a little better to explain this. It may also be untrue---I'd like to know if there are any full-feed problems on Juniper. Does Juniper even make ``low-end'' routers?
IPv6 was a software upgrade. Everything is always a software upgrade. extremely obstinate about supporting IPv6, and when supported, full of gotchya's like multicast v6. IPv6 was a software upgrade, and there's good support in the data plane.
Does tunneling. no tunneling, or does tunneling on the control plane. none, never. Always does tunneling, and does it on the data plane.
could do MPLS if anyone wrote software that complicated for a PeeCee, but no one does sometimes you can't do MPLS on an ordinary switchport. Only special line cards of higher cost and lower port density can face the MPLS core, but all platforms offer some way to architect a network that depends on line-rate MPLS. It is part of the vendors' vision about who will buy these switches. I think they usually do MPLS just fine on the data plane, but I don't know how to configure MPLS yet, so i'm not positive.
I think they will do about 100Mbit/s of IPsec, but I'm not sure. Usually no IPsec at all. On fancy software like Cisco's, they will give you control plane IPsec which is intended only for protection of traffic to or from the control plane, like telnet/ssh, SNMP, or the one routing protocol that needs IPsec: OSPFv3.

There's an extremely weird IPsec module for the 6500 that will do one or two gigabits of IPsec (on a switch with a 32Gbit/s - 720Gbit/s backplane :/ ), but in addition to being disgustingly slow, it's a strange, clumsy thing to configure.

Seamless integration of IPsec. They can do IPsec on the data plane, but sometimes you have to pay extra for the IPsec chip. Low-end routers' IPsec capacity seems to be in the neighborhood of 1/2 - 1/4 of their total forwarding capacity. High-end routers are like the L3 switch for IPsec and can't do any meaningful amount of it.

The Future

IPsec is going into Ethernet MAC chips. No, really. It's going to happen. We're going to get IPv6 just like they said, and just like they always said IPsec is going into Ethernet MAC's. It's going to happen ten or fifteen years later than the envisioners expected, but it'll happen---gigabit MAC's will do line-rate AES counter-mode just like wireless MAC's do line-rate WPA. It'll be a disaster because the clueless chip companies will implement IPv4 IPsec only, and even though IPsec is supposedly a ``mandatory'' part of IPv6, all the switch vendors will be doing IPv6 IPsec on the control plane or not at all. Meanwhile they'll be able to do v4 IPsec at the full backplane bandwidth of their chassis, so they'll sell licenses for ``number of sessions'' or encourage needlessly short timers for ``extra security'' and make IKE accelerators.

TCP is going into the load balancer. Clusters are no longer exotic because of VM farms and big web sites, and are no longer impractical now that buyers have given up asking for cluster-shared memory, but they need certain things. Some need a SAN, others need to handle the largest number of web hits per dollar and per watt, and others need to do the largest amount of SSL negotiations per dollar and watt. The big manufacturers of ``blade'' systems will respond to this by partnering with some second-rate SAN or networking company to allow Fabric Management Modules in their blade chassis and offer some non-Ethernet fabric like FC or IB among their blades and between cabinets at a deep discount.

There will be read-only filesystems for sale from startups designed to work with these single-vendor fabric-enabled blade clusters. These filesystems will be implemented entirely as a userland shared library that sit between Apache and iSCSI, like filesystems in VM/CMS, but they'll use a chunk of SYSV shared memory as buffer cache. The filesystems will be buggy and expensive and make fantastic privilege escalation vectors, except that they'll be so complicated and costly that hackers will never figure out how they work.

The blade fabric system will double your web throughput, but only if you buy from them the vendor-locked-in extremely expensive and annoyingly quirky head end module. It replaces your load balancer with something you like much, much less, and it costs only slightly less than twice as many CPU's and uses only slightly less power, but it saves you $50,000 and $4,000/mo, so you don't really have a choice do you? It will convert <fancy non-Ethernet SCTP fabric> to Ethernet and TCP.

Yes, it's TCP on the dataplane, the true ``L4 switch'' for anyone crazy enough to want it. It'll take care of all the TCP windows and timers and stuff in hardware and squeeze twice the number of slowass bittorrent-overloaded lossy DSL user web hits out of the same crappy Intel CPU farm. Pie charts, histograms, success stories from gullable medium-sized banks will show this. Horrible bugs will be discovered in its TCP implementation. Hackers will be able to take down whole web sites until new revisions of physical line cards are produced.

There will be TCP specification noncompliance, too. Windows, Mac OS X, and Linux (and three years later, Solaris) will all release software updates to work around misimplementations of TCP in a certain popular load balancer's data plane that cause poor performance over congested or high-speed links. BSD will not implement the workaround, and will always be oddly slow at accessing myspace, but we'll be stuck with these workarounds forever. The buggy cards will disappear after four years. The development of TCP will be retarded by about seven years by the imperative of keeping the workarounds forever, and by extensive bickering about them.

Yes, we will get IPv6, but we will be stuck with 1500-byte packets on the Internet forever. Any packet larger or smaller than 1500 bytes will exercise strange obscure bugs all over the place so no one will use them on the Internet, but every AS will secretly have, use, and depend on 9kByte or larger packets internally. Netadmins will have dick-size contests about whose AS can pass the largest packet end-to-end, but they will never sell any link to any customer for less than $5,000/mo that can transfer packets larger than 1.5kByte. The largest packet will be 64kBytes and will be quietly implemented throughout Sweden.

A quiet well-considered voice will beg for packets with at least 8kByte of payload and some L4 mechanism for splitting streams into receiver-directed aligned mmu-page-sized chunks. Someone will scream that it doesn't work with multicast, and distract everyone from finishing the proposal for eight months. RFC's for doing this will finally be published, discussed, perfected, and ignored. However the TCP-to-SCTP-converting load balancer mentioned earlier will learn to do this job, and through awkward proprietary tricks convert 1496-byte-MTU TCP streams into neatly page-aligned 8kByte chunkified SCTP streams that fancy InfiniBand cards can deliver to and from Apache without a single syscall. There will be a knob where you can choose if you want smaller packets padded to 8kByte or not, and this knob will be stupid and useless and everyone will unknowingly have it set wrongly.

L3 switches / map / carton's page / Miles Nordin <carton@Ivy.NET>
Last update (UTC timezone): $Id: switch-l3intro.html,v 1.2 2008/03/16 19:00:13 carton Exp $