route add ...
command. This is the wrong term
for us. Likewise, he might think a ``switch'' is a device with no
user interface that he jacks into his network the same way he used to
wire a hub. This might be right, but if he thinks that
switching must be no more or less than the job this unmanaged
switch does, again, that's wrong for us.For us, forwarding and switching will be synonyms. In those examples, no routing is happening, only switching. Accepting a packet, consulting a table to decide where the packet goes next, and transmitting it there, is switching.
Protocols that decide what belongs in the table are routing protocols. RIP, OSPF, and IS-IS are examples of routing protocols, and the device that implements these protocols is doing routing.
The data plane is much weirder. It contains many CPU's, but none of them wired like a traditional computer. They'll be running tiny programs written by joyless electrical engineers in disgusting pseudoassembler languages, stored in ad-hoc memory devices. The CPUs will probably be Harvard architecture because electrical engineers are comfortable with arbitrary rigidity and have cargo-cult design habits. They will call their programs ``microcode,'' but in general this label is unfair. For example most gigabit Ethernet MAC chips include one or two MIPS cores, the same instruction set used in the old SGI workstations, the Sony Playstation 2, and the control planes of many routers and switches. And if you look at the ``network module'' of a low-end Cisco router, which is part of the data plane, you'll probably see a recognizeable discrete CPU like a 68k, an 80186, an i960---something like that (XXX -- I should dig for what they actually use). The electrical engineers call it ``microcode'' because they're too dumb to get a compiler toolchain working, so they try to imply it would be inappropriate to want one.
There are other non-CPUish things on the data plane just as there are non-CPUish chips in an ordinary computer or in a control plane, but they're wired abnormally around the flow of packets through the router or switch. There will be FPGA's (look for chips stamped Xilinx, Lattice, Altera) and custom ASIC's with the router or switch vendor's name stamped right on them. Many of these FPGA's and ASIC's do have some strange task, yes, but you'll find each chip often spends a quarter of its silicon real estate on a RISC-ish CPU core like ARM or MIPS. CPU-ish and non-CPUish things are often in the same chip, but instead of one big pool of memory there are smaller pools each with different, specific purposes.
The custom logic and so-called ``microcode'' of these many CPUs on the data plane act in concert, accepting simple direction from the control plane, to perform switching. Because the CPUs run small programs with tight loops, they probably don't do much interrupt handling or context switching, maybe none at all. This keeps the data plane's latency extremely low, on the order of the transmission time of a single packet.
What's the device called? | What's the data plane doing? | What's the control plane doing? |
---|---|---|
Unix PeeCee router | doesn't exist |
L3 switching sometimes L2 switching and STP probably not doing routing (but can, Quagga/XORP) |
Linksys/D-Link/Netgear junk | L2 switching |
L3 switching (doesn't do routing, nor even STP) |
L2 switch | L2 switching | STP/IGMP, SNMP |
L3 switch | L3 switching, L2 switching | routing, STP/IGMP, SNMP |
router |
L3 switching sometimes L2 switching |
routing, SNMP sometimes STP |
If you run Quagga on your Unix PeeCee, the above table isn't correct for you. Quagga running in userspace is sort of like the control plane in a real router, and the kernel IP stack is sort of like the data plane. The routing socket API that Quagga uses to talk to the kernel is similar in purpose to the interface between the control plane and data plane in a real router. However, I think this analogy harms understanding more than it helps! The Unix PeeCee still doesn't have a data plane. These words are reserved for physical wires inside the device over which packets flow, and chips that implement functions. In a Unix PeeCee, there is a single plane---packets flow through the CPU, through the same wires and chips that are implementing OSPF and other control plane functions. Some real routers with separate planes also have, within the control plane, a boundary between kernel and userspace---this boundary is a feature of the real router's control plane that's generally none of your concern. The interface between the data and control planes is something physical, not a software API or system call vector. And, adding to the confusion, some real routers with data planes can forward packets on the control plane when the packets need special mangling. Cisco calls forwarding packets through the control plane ``process switching.'' On a Unix PeeCee, everything is process-switched.
Fancier devices can push more of their function into data planes. L2 switches---the simplest data plane---can bridge Ethernet frames out the right port based on destination MAC address, and I think the learning function, where the switch associates ports with destination MAC's by watching the source MAC's of frames it receives, also happens in the data plane.
L3 switches can pass IP traffic in the general case. The data plane has to read past the Ethernet frame header to the IP address and TCP/UDP port numbers. It has to rewrite the Ethernet frame's destination MAC address with that of the next-hop it's looked up in the ARP table, and replace the source MAC address with its own. Because of this rewriting, it has to recalculate CRC's for the Ethernet FCS, which L2 data planes don't need to do. It also has to decrement L3 IP TTL's (no more storms!). And because it has to decrement TTL's, it has to calculate and rewrite both IP checksums and Ethernet CRC's, which L2 switches don't have to do either. It has to do all these things. But there are tricks. The data plane of both the Cisco 6500 and the Extreme i-series can't forward the following types of packets in the data plane, which means they're punted up to the control plane.
show mls rate-limit
to see
brief, incomprehensible names for them. There are lots of gotchya's
and corner cases. For example, consider the following sentence:
``Layer 2 rate limiters do not work in the truncated fabric-switching
mode: that is, they do not work in a system that has a mix of
fabric-enabled and traditional, non-fabric-enabled line cards or if
the system has classic line cards and redundant Cisco Catalyst 6500
Series Supervisor Engine 720s.'' What does the sentence mean? I
could look up the different switching modes of the 6500 and give you
some more incomprehensible sentences to explain the first one, but the
short answer is, we netadmins will probably never fully understand
that sentence, and further understanding isn't helpful without
experience mitigating DDoS's.Anyway, here are some comprehensible ``special'' traffic types that both Cisco and Extreme switch on the control plane instead of the data plane.
receive
adjacency.'' I guess this is obvious.But for example Cisco has interest (and some progress) in implementing BFD on the data plane. BFD is a very simple UDP-based HELLO protocol meant to bolt onto routing protocols like OSPF and IS-IS, and its primary design motivation was to replace the HELLO protocols built into the control-plane routing protocols with something simple enough to implement on the data plane, so HELLO timers could be set extremely short, like 50ms or less, instead of the current practice, tens of seconds. Once BFD is implemented on a data plane somewhere, that'll be a case of packets aimed at the router itself which aren't sent to the control plane.
The value the table returns is an output interface and an optional next-hop IP address. If the table includes the optional next-hop value, the right way to forward the packet is, look up the next-hop with ARP, set the destination MAC to the one given in the ARP reply, and then dump the frame onto the wire indicated by the output interface field. If the next-hop is missing, instead look up the destination IP with ARP, and use that ARP reply to overwrite the destination MAC address.
So far we've described two tables:
ARP cache: {destination IP, prefix length} -> {output interface, next-hop IP}
{wire IP, output interface} -> {destination MAC address}
show ip cef
only shows the first table, but in practice,
I suspect Cisco combines the two tables. Next-hops that don't have
destination MAC's from ARP yet are marked with the CEF
glean
adjacency and sent to the control plane for
buffering. (another nasty DDoS vector.)
These two tables are enough to implement L3 switching, but not what we
call routing on this page (OSPF, IS-IS, BGP). Unlike the way we
usually think of Unix routers that I've just described, real routers
contain more than one copy of a routing table. The RIB is a table of
L3 destinations and next-hops kept inside the control plane. The RIB
can contain more than one tuple for a single destination, and when
this happens some of the routes to a given destination may be
inactive, because they don't have the best metric. On Cisco,
show ip route
pretty-prints the RIB, and punctuation
characters mark the active and inactive routes. The FIB, copied from
the RIB, contains only active routes, and contains no metrics.
The FIB might still contain multiple routes to a single destination, if the data plane is intended to load-balance traffic across two interfaces or next-hops. The most accepted way to do load-balancing is to key the FIB lookup off a hash of the TCP and UDP port numbers---this guarantees a given TCP circuit will follow the same path, unless control plane's OSPF/IS-IS/BGP routing protocols decide to redirect it, which won't happen often. This insures the packets belonging to that TCP circuit arrive in order (which isn't important, but some people think it is, and other people design very stupid not-Internet-ready UDP protocols where in-order delivery actually is important). The RADIX multipath scheme can also reduce flows' RTT variance (``jitter''), which is nice for TCP and critical for VoIP.
On Cisco, show {ip,ipv6} cef
is one way to pretty-print
something like the FIB. The FIB more or less belongs in the data
plane, but as we'll see, this is arguable.
The two tables I described above aren't enough to implement what people now expect of a router (or even an L3 switch). Policy routing demands putting more values into the lookup key of the FIB such as: source IP address, IP TOS, packet size, maybe TCP or UDP port numbers. Access lists mean sometimes the output interface will be ``discard this packet.'' Unless the access list or policy routing rules can be translated (``compiled'' or ``optimized'') somehow, we may need to add some kind of sequence number in the lookup key because policy routing and access lists both have a first-match character, which means modifying the longest-match lookup rule. We are already maybe needing strange kinds of specially-constructed silicon memory to store the FIB in a way that longest-match lookups are O(1). Achieving O(1) lookups is even harder with all this policy routing and access list stuff. In general it's not acceptable to have rules that access lists can't be ``too long,'' so O(1) lookups is a requirement.
The Riverstone 15000 Edge Router was one of the first L3 switches. It doesn't have the word ``switch'' in its name, but with the benefit of hindsight we can see, that's what it was. Riverstone's idea was flow forwarding---a sneaky way to get these policy routing and access list features while using much less fancy FIB-storage memory chips, less fancy even than chips capable of longest-match.
A ``flow'' is now an accepted, first-class entity in the L3 world since the ``flow label'' field got put into IPv6. The IPv6 flow label is at layer 3, not layer 4 or 5 or whatever silly OSI-layer-model-memorizers will tell you. It's missing in IPv4, but in that case it can be approximated by the header hash of protocol and port numbers I spoke of earlier that's useful for multipath, multiple FIB entries for a single destination, and for implementing WFQ output queues that punish poorly-behaved flows that aren't using good congestion-backoff algorithms.
well, whatever... I don't mean to ramble. The point is, the flow idea is already used for more than just flow switching. A flow is supposed to approximate a single conversation, like an established TCP circuit, or a UDP streaming source sending a string of packets with the same UDP source and destination ports and no more than a few seconds between packets. Riverstone's idea is to optimize their data plane for long flows. They replace traditional FIB lookups with a flow lookup in a table that has one entry per flow. The first packet of each flow gets switched by the control plane. The traditional FIB then stays in the control plane, or it can be a nonexistent thing derived from the RIB on demand. The FIB in the data plane is replaced with a route cache or a flow cache. I think Extreme calls it an IPFDB.
This has some cheap, convenient consequences. The data plane doesn't have to implement access lists because only permitted combinations of source, destinaiton, protocol, and port number will be entered into the route cache. The data plane doesn't need to implement load balancing, either, because the control plane can implement best-common-practice RADIX load balancing by giving different next-hops to the tens, hundreds, thousands of route cache children that may spawn from a pair of RIB entries calling for a load-balanced route.
As far as I can tell, the Extreme IPFDB route cache always contains fully-specified entries with:
{input interface, source IP, destination IP, L4 protocol
(which can be only TCP or UDP), L4 source port, L4 destination port}
-> {output interface, L2 destination MAC, boolean: rewrite or
don't-rewrite the source MAC}
(With flow forwarding in general, sometimes input
interface
is part of the table, and other times there is a
separate table for each input interface.)
Extreme's fully-populated-route-cache requirement (which might be a
fantasy of mine, not an actual requirement, but...) means, they
forward anything that's not TCP or UDP through the control plane.
That includes ICMP pings passing through the switch, GRE tunnels,
IPsec---any IP protocol besides TCP or UDP. I find the idea of
putting something like that on the public Internet a little chilling!
I don't actually have a fast enough Internet connection for it to
matter. I also don't enable ipforwarding
on any Extreme
VLAN that's in front of my firewall. The Extreme stuff works well for
me, but this flow forwarding idea has some drawbacks.
There's another problem. Flow-forwarding switches aren't good enough for certain traffic mixes. Suppose the traffic mix is in front of a load balancer (which must be a lot more expensive than your flow-based L3 switch!), a few gigabits worth of web users reloading slashdot.org or search.yahoo.com. Here is the first problem: the flows are short. They're probably only 4 - 10 packets for each connection, and in this case one connection is two flows. The route cache on the data plane doesn't have infinite size---it will fill up, or ``thrash.'' And the first packets of these flows will overwhelm the control plane.
You might say, ``well what if we don't store so much in the route cache. What
if we don't need to implement policy routing and access lists? We
just want to do BGP on the switch, and any other fancy stuff we need
to do will be the load balancer's job.'' Riverstone let you
control
how many fields go into the route cache lookup key. ip set
forwarding-mode destination-based
implements this what-if, but
the Riverstone route cache still doesn't contain longest-match destinations.
It contains IPv4 /32's, and the control plane still has to put them
there. If your pool of webservers is sending back TCP SYNACK's to web
browsers all over the Internet, the control plane will still be
overwhelmed by first-packets, and the data plane route cache will still
overflow with cached entries. It's not solveable this way.
The answer is to pay for a device that has a fancier longest-match FIB in its data plane---Cisco 6500 with a modern supervisor, instead of Extreme Alpine/i-series or Riverstone 15000. Extreme even sells this bizarre longest-match module for the BlackDiamond, but it still doesn't seem able to switch the backplane's full bandwidth in longest-match mode.
Even a longest-match FIB will still have a limited size, and this matters. It's actually worse in a certain way. For example, for the Cisco 6500, to run a default-free full BGP table, you have to buy the PFC3BXL instead of the PFC3B. If you are not running a full BGP table, you don't have to buy the XL version of the PFC, though they still tried to sell us one. Now, suppose the traffic mix is made of longer flows, big data transfers, instead of reloads of tiny web pages, so Extreme's flow-switching is fine. But suppose this time the L3 switch needs to run a full BGP table from two or more different ISP's. In this case, the control plane needs to have enough DRAM to keep a full BGP table in the RIB, but the data plane's route cache sizing only cares how quickly flows are brought up and torn down, not a bit about how big is the RIB. Extreme flow-switching has another cost advantage here. On Cisco, not only does the MFSC still need enough DRAM for a full-feed RIB, but one also needs the XL version of the PFC to store the >200k rows of longest-match FIB in the data plane.
Some routers will do IPsec on the data plane, but only the small ones. (discussed below)
In general, the data planes of routers are slower and more flexible than those of L3 switches. However, I think this is less true as you pay more money for the device---as we've discussed already, the most expensive L3 switches have longest-match FIB's instead of flow-forwarding (Extreme) or an optional exact-match mode (Riverstone). Likewise, the most expensive routers can switch small packets at line rate, but have lots of quirky limitations about which features one can use with which line cards.
TCP is going into the load balancer. Clusters are no longer exotic because of VM farms and big web sites, and are no longer impractical now that buyers have given up asking for cluster-shared memory, but they need certain things. Some need a SAN, others need to handle the largest number of web hits per dollar and per watt, and others need to do the largest amount of SSL negotiations per dollar and watt. The big manufacturers of ``blade'' systems will respond to this by partnering with some second-rate SAN or networking company to allow Fabric Management Modules in their blade chassis and offer some non-Ethernet fabric like FC or IB among their blades and between cabinets at a deep discount.
There will be read-only filesystems for sale from startups designed to work with these single-vendor fabric-enabled blade clusters. These filesystems will be implemented entirely as a userland shared library that sit between Apache and iSCSI, like filesystems in VM/CMS, but they'll use a chunk of SYSV shared memory as buffer cache. The filesystems will be buggy and expensive and make fantastic privilege escalation vectors, except that they'll be so complicated and costly that hackers will never figure out how they work.
The blade fabric system will double your web throughput, but only if you buy from them the vendor-locked-in extremely expensive and annoyingly quirky head end module. It replaces your load balancer with something you like much, much less, and it costs only slightly less than twice as many CPU's and uses only slightly less power, but it saves you $50,000 and $4,000/mo, so you don't really have a choice do you? It will convert <fancy non-Ethernet SCTP fabric> to Ethernet and TCP.
Yes, it's TCP on the dataplane, the true ``L4 switch'' for anyone crazy enough to want it. It'll take care of all the TCP windows and timers and stuff in hardware and squeeze twice the number of slowass bittorrent-overloaded lossy DSL user web hits out of the same crappy Intel CPU farm. Pie charts, histograms, success stories from gullable medium-sized banks will show this. Horrible bugs will be discovered in its TCP implementation. Hackers will be able to take down whole web sites until new revisions of physical line cards are produced.
There will be TCP specification noncompliance, too. Windows, Mac OS X, and Linux (and three years later, Solaris) will all release software updates to work around misimplementations of TCP in a certain popular load balancer's data plane that cause poor performance over congested or high-speed links. BSD will not implement the workaround, and will always be oddly slow at accessing myspace, but we'll be stuck with these workarounds forever. The buggy cards will disappear after four years. The development of TCP will be retarded by about seven years by the imperative of keeping the workarounds forever, and by extensive bickering about them.
Yes, we will get IPv6, but we will be stuck with 1500-byte packets on the Internet forever. Any packet larger or smaller than 1500 bytes will exercise strange obscure bugs all over the place so no one will use them on the Internet, but every AS will secretly have, use, and depend on 9kByte or larger packets internally. Netadmins will have dick-size contests about whose AS can pass the largest packet end-to-end, but they will never sell any link to any customer for less than $5,000/mo that can transfer packets larger than 1.5kByte. The largest packet will be 64kBytes and will be quietly implemented throughout Sweden.
A quiet well-considered voice will beg for packets with at least 8kByte of payload and some L4 mechanism for splitting streams into receiver-directed aligned mmu-page-sized chunks. Someone will scream that it doesn't work with multicast, and distract everyone from finishing the proposal for eight months. RFC's for doing this will finally be published, discussed, perfected, and ignored. However the TCP-to-SCTP-converting load balancer mentioned earlier will learn to do this job, and through awkward proprietary tricks convert 1496-byte-MTU TCP streams into neatly page-aligned 8kByte chunkified SCTP streams that fancy InfiniBand cards can deliver to and from Apache without a single syscall. There will be a knob where you can choose if you want smaller packets padded to 8kByte or not, and this knob will be stupid and useless and everyone will unknowingly have it set wrongly.