Switched Ethernet considered harmful

Switched Ethernets have a failure mode called a ``storm.'' Many things can cause a storm. Cisco's manuals warn against several variants of network ``misconfigurations'' that can cause storms, which makes it sound like it's your ``fault'' that the storm happened because you misconfigured the network, but by the time I'd read Cisco's third warning against possible storms, I realized that ``misconfiguration'' means ``a configuration that could cause a storm.''

It's nice when a large network is minimally vulnerable to those denial-of-service attacks that ordinary users can perpetrate, with access only to leaf network ports. In a big company, some ordinary users might be apathetic towards the company's interests, or perhaps even Enemy Infiltrators. What's more, some ordinary users will be stupid, ill-informed, and perhaps even desperate. They can't be counted on not to ``misconfigure'' the network.

One ``misconfiguration'' that can sometimes cause a storm is connecting two RJ45 wall jacks with a crossover cable. Ordinary non-switched hubs will detect this condition and ``partition'' the two ports involved, meaning turn them off for a time interval a couple of orders of magnitude larger than the interval the hub needs to detect the ``misconfiguration.'' It would be good if switches could survive this mistake without causing a storm. Switches accomplish this with ``spanning-tree protocol.'' STP is like an inverse routing protocol, creating a map of how not to get there from here. Its purpose is to learn the interconnections among switches, and then turn off (``STP-block'') ports until these interconnections form a tree rather than a graph. Unfortunately, a switched port has many configurable properties which affect STP, which is why there are so many warnings about storms in the Cisco manuals. Detecting looped-back ports is precarious for a switch, while for a hub it is simple: because hubs transmit packets on all ports at once rather than store-and-forward, it's easy for a hub to find the misbehaving ports through timing: looped ports don't obey Ethernet's collision detection rules.

Unfortunately, the real situation is not quite that predictable. There was a ``security issue'' with Cisco's most popular switch and beta versions of Windows XP, where plugging a Windows XP machine into a switched Cisco network caused a storm which engulfed the entire switched network, including remote offices connected by switches rather than routers. One ordinary user could inadvertently bring down a company's entire network just by plugging a normal laptop into it, which is what happened to Xerox, three times. Of course it's a ``bug'' and Cisco came out with a ``fix,'' but other network architectures can tolerate bugs without crashing the entire WAN.

What is a storm? It's when one packet injected into the switched network can be replicated and retransmitted without bound, because of a routing loop. Unlike hubs, switches receive packets completely, then retransmit them, which means a routing loop could immortalize a packet.

Routed networks don't have storms because routers decrement the ``time-to-live'' (TTL) header field every time they forward a packet. Routing loops can exist, without causing storms.

There are other problems.

Switches make 'traceroute' useless. Traceroute is a safe, bandwidth-inexpensive tool for diagnosing network problems. It's so safe, ordinary users can harmlessly use it. Traceroute works by sending a series of packets with increasing TTL's, starting at one.

Traceroute [uses] the IP protocol `time to live' field and attempts to elicit an ICMP TIME_EXCEEDED response from each gateway along the path to some host. (traceroute(8))

No TTL, no traceroute.

What's the difference between a switch and a router? Routers have more expensive software in them. Cisco has designed a precarious network architecture around a marketing strategy designed to tap into the demand curve at multiple points, by creating an entire set of broken standards around crippled, lower-cost software licenses. In Cisco-world, a router can do X, Y, and Z, while a switch can do only X and Y. One connects a small number of routers to a large number of switches if his or her Z-needs are modest but non-zero. This situation ought to infuriate customers, so Cisco stresses the positive aspects of crippled switches: their brokenness makes them ``simpler,'' and thus easier to administrate (except for the lack of traceroute and the proliferation of obscure cases which cause storms).

I'm actually pretty well-convinced that small switched networks are easier to set up than routed networks, but I think that, with large networks, they are a liability. The appeal of switches in large networks is no longer simplicity, because they actually add complexity---rather, the true appeal of large switched networks is the feeling of power a sysadmin gets from having absolute control over each individual port in each plebe's office. I believe this delicious high is more than obliterated by the feeling of impotence that comes from losing control of an entire network to an obscure boundary case that causes a storm, and being forced to flip power switches, pull cables, and use out-of-band management to regain control---only to remain at the mercy of the next Windows XP user or obscure boundary case to poke its head around the corner.

Other problems with switches mostly boil down to lack-of-Z. The big fun in packet-switched networks right now is all about queueing strategies and QoS. This fun means there is a whole mess of Z features associated with anything that has an output queue---which includes anything that receives whole packets, then retransmits them. Routers, switches, and individual hosts all have output queues, and hubs do not have output queues.

Lack-of-Z means switches can't implement clever queueing strategies. They can't use ``explicit congestion notification'' extensions to TCP, because they aren't allowed to mangle TCP headers. They can't implement any queueing strategy with classes like CBQ or H-FSC, because they can't access enough header information to classify packets. They can't set up stateful real-time channels with RSVP, because there is no way for individual hosts using RSVP to address the switch. While switches may claim to implement ``QoS,'' they can only assign packets to service classes based on the physical port through which the packet enters the switch, while more useful router QoS can classify based on protocol port numbers or RSVP reservations.

One might argue that switches are good for ``normal'' networks, which have only one class of packet and no QoS, and in which links never go down so we don't need traceroute. okay. Consider the following two architectures, identical except for one using a switch and one using a hub.

		 +------------+
 		 |   Hub      |
       	       	 +------------+	     (100Mb/s)
       (100Mb/s)   /  /	\ \ \_______________________
      ____________/  / 	 \ \___________	       	    \
     /	       _____/ 	  \_           \ (100Mb/s)  |
     |	      /(100Mb/s)    \(100Mb/s) |       	    |
 +-------+  +---------+  +---------+  +---------+  +---------+
 |Server |  +---------+  +---------+  +---------+  +---------+
 +-------+     	       	       NFS clients

                          -or-

		 +------------+
 		 |  Switch    |
       	       	 +------------+	     (100Mb/s)
       (100Mb/s)   /  /	\ \ \_______________________
      ____________/  / 	 \ \___________	       	    \
     /	       _____/ 	  \_           \ (100Mb/s)  |
     |	      /(100Mb/s)    \(100Mb/s) |       	    |
 +-------+  +---------+  +---------+  +---------+  +---------+
 |Server |  +---------+  +---------+  +---------+  +---------+
 +-------+     	       	       NFS clients

Almost all the traffic is between the NFS server and one of the clients. Under load, the switched network will accumulate an output queue on the switch's port to the NFS server. The switch must limit the length of this queue by dropping packets when the queue is full. Clients must learn to limit their collective output bandwidth to less than 100Mb/s total by watching their packets get dropped. NFS-over-UDP is very bad at responding to congestion like this. NFS-over-TCP is better, but whether using NFS-over-UDP or NFS-over-TCP, packets will get dropped, and the user's disk access will be delayed by a retransmission interval. The advantage to TCP's congestion control is supposedly that packets will get dropped less often, but TCP does this by throttling its output every time a packet gets lost, and the throttling itself also harms the filesystem's latency.

The unswitched network will not lose packets, because each NFS client knows when another NFS client is transmitting. The built-in carrier sense feature of Ethernet will make packets wait on the clients until the server has free bandwidth. The output queue which was dropping packets is simply eliminated.

To compensate for switched Ethernet's limitations, Sun is changing the default transport protocol for NFS from UDP to TCP. Without ECN, this is IMO a very dubious compromise, because it is essentially conceding that your filesystem will ~frequently wedge for a TCP RTO (about 1 second) when the switch commands TCP to adjust its congestion window by dropping a packet.

This effect potentially gets much worse with a storage network like iSCSI, because although iSCSI uses TCP, there can be tens or hundreds of disks trying to send data concurrently to a single initiator (which is, for example, reading a wide RAID stripe), making the ``right size'' of the output buffer on the switch the same size as the RAID stripe, which is unreasonably large to fit in a network element's output buffer (~1MByte). Even if it were analagous to the NFS case in the amount of dropped packets, here we are freezing access to the entire filesystem for everyone, while in the NFS case we freeze one of what is presumably many clients. And it is much worse packet-loss-wise, because the large number of independent TCP streams all sending in lock-step interferes with the TCP stack inside the iSCSI disks' ability to do congestion control, just as bittorrent does to home user's DSL upstreams, only really much worse because the disks will be better synchronized than bittorrent peers. Here, we are not talking about just freezing for an RTO as one little ``adjustment'' packet gets lost. We're talking about losing some large fraction of the RAID stripe, on every disk access When this effect happens, it could make every stripe access take several RTO's to complete, changing a 10ms operation into a > 3 second operation. It's basically a show-stopper to the implementation.

SCSI, I think, deals with this problem exactly how Ethernet does, with some kind of CSMA/CD. Since the maximum physical bus length of ~15 feet / 16 bits wide, measured in bits, is 300 - 1000 times shorter for SCSI than Ethernet, I suspect carrier-sense can be a lot more efficient. The collision window is like 1 bit. There may also be some kind of polling (disconnect/reconnect) involved, which is sort of analagous to token passing, but I don't really understand SCSI well.

I'd like to understand how Fibre Channel ``fabrics'' and SAS switches and ``fan-outs'' deal with this effect, since they can have hundreds of disks, and they have hundred-meter links only 1 - 4 bits wide. I think there is another type of network switch out there, one which has zero shared buffer memory and solves the wire length problem while still remaining very hub-like in that it never drops packets. And I suspect this new type of network switch is already in use all over the place in Infiniband, SAS, FC, USB. But for me it is like reading tea leaves now. I don't know anything about this hypothetical new kind of switch except the vague impression.

I realize there are supposedly other applications where switches make a lot more sense than these bad cases I'm using as examples. The problem is that my example application-and-topology of NFS fits most acutal networks quite well. For example, it also fits the case where the switch's fifth port is not an NFS server, but rather a 100Mbit/s ``uplink'' to a router or to another switch.

And the even more pathological example of storage is already very important at the network edge.

Just because I use the word ``latency'' doesn't mean I'm arguing that hubs solve a QoS problem where switches fail. My argument is simply that switches bring no help and some harm to typical non-real-time network applications. Don't confuse the fact that this particular harm tends to be ``latency on the order of seconds'' with a formal discussion of QoS.

My point is much simpler, and doesn't raise the tangled mess of QoS issues at all. For the old-fashioned network applications like web browsing and NFS mounting that currently dominate LAN wiring, a simple broadcast network will share bandwidth better, and perform better, than a switch with tail-drop FIFO output queues. In fact, there is an embarassing lack of applications where extending FIFO tail-drop point-to-point links out to leaf nodes makes sense in terms of performance (never mind cost), yet this is currently the most popular network architecture. And ``switched Ethernet'' can never do better without turning into a bloated cargo-cult monstrosity.

What's the way out of this mess?

We could go back to hubs and routers. I like this idea, for new networks in the short-term, because if done right it ends up equating to simply not spending money on useless things. If you want VoIP without switches, just buy a separate hub for phones and keep the traffic physically separate until it hits the outermost router.

We could invent faster ports, like ``Gigabit Ethernet,'' and use them for switch uplinks and NFS servers. The problem with this idea is that, based on our experience with 100BaseTX, there is only a short gap-in-time between (a) when the next-generation, 10-times-faster technology becomes available and appropriately priced for NFS servers and uplink ports, and (b) when it's cheap everywhere, and using the slower standard doesn't really save money.

At Evolving in 1997, we were trying to do this 100Mbit/s-server, 10Mbit/s-client thing, and the big problem was that the 100Mbit/s interfaces simply didn't work. HP shipped interfaces with buggy drivers that crashed the whole machine, and Sun shipped first-generation interfaces that dropped packets over 30Mbit/s. By the time the vendors got these problems worked out, they were putting 100Mbit/s interfaces into everything for free.

The other problem is, this doesn't really solve the problem of having superfluous output queues on switches that lack cool Z-features that we need for ECN and QoS. It just moves the output queues onto the switch's slower ports connected to the client-machines, which then get overwhelmed with bursts of information from the NFS server or the uplink switch. This leads to the hilarious situation where users experience lower latencies when the network (and thus the uplink port or NFS server) is busy with lots of different clients than when it's quiet, and each user is the only contender for the high-speed port's swamping attention.

We could start adding features to switches until they become just as complicated as routers, except flakier. This is what I expect will actually happen, but I wish we could just stick with the perfectly fine routers we already have rather than reinventing the router in this ass-backwards way which, in an effort to make things simpler, ends up being more complicated. So far, we are well on the way toward this strategy. The broken switch protocols are now ISO standards, and ordinary hosts that aren't switches at all are adopting them.

This trend of making more devices that pretend to be switches means that a lot more wall jacks in regular plebe offices will have special switch-options enabled on them, creating STP special cases and storm vulnerabilities. This is, in fact, exactly the trend that led to the disastrous Cisco bug with Windows XP, except that the Cisco bug happened without even enabling any special port options. The benefit, of abusing switch protocols by extending them into leaf hosts? A very limited half-baked QoS inferior to that which routers provide to ordinary hosts that need not pretend to be switches.

According to Cisco's manuals, the Cisco VoIP phones contain virtual 3-port switches, in which one port is hardwired to the phone's guts, and the other two ports are exposed on the back. This way, one can install the phone in-line between computer and wall jack. The phone can use proprietary ``Inter-Switch Link'' packets between itself and the switch serving the wall jack. These MAC-layer ISL headers and nonstandard jumbo-frames allow the phone to place its own packets in a different QoS class from the office-computer, even though switches normally can't split packets on a single port into two classes. Indeed, the switches are still classifying based on physical port, if you accept Cisco's metaphor of a switch with an unexposed physical port inside the phone. so, if Cisco predictably says ``only misconfigured networks let hosts use protocols reserved for our switches. Do not do this. It can cause storms.'' the correct response is, ``You started it!''

Ultimately, I think the current use of switches is exactly backwards. Right now, we imagine routers as necessary closer to the network's core, while leaf topology is better handled by ``simpler'' switches. Instead, switches belong at the core of high-speed networks, surrounded by routers. Here, a switch's simplicity is justified: the external interfaces are almost as fast as the structures inside the switch, so there is no time for complicated computation. Instead of many switches feeding into one router, we have many routers feeding into one switch. The routers can pre-process packets for the switch, marking packets' outer headers for specific output ports on the switch so that the switch contains no routing table. Hopefully, the routers might even cooperate amongst themselves to implement QoS on the switch's output ports with only minimal help from the switch. This is much better than the status quo, because we have some hope of devising clever distributed algorithms to regain the Z features usually lost by using a switch, and because the switch's stupidity and vulnerability is contained within a tightly-controlled environment manipulated and preserved by routers rather than left naked to unpredictable users.

This, too, is already at least being planned. Hitachi defends the necessity of distributed routing and QoS classification, while the following article foolishly proposes to simply make routing really fast, so the complicated algorithms can still be implemented in one place, using some space-age optical computronium material. riiiight. Only optical computronium has the time-to-market and protocol-agility features that Next-generation Networks will demand. GMU's foolishness or grant-grabby sneakyness is further apparent from their claim that IPv6 will make routing more challenging because the addresses are bigger than IPv4, while most everyone else understands that IPv6 will make routing tables smaller, not bigger.

All that said, I think a lot of companies like switches for no more complicated reason than just so they can plug 10Mbit/s or 100Mbit/s devices into any wall jack and have the switch automagically sense the device's speed. <shrug>.

The problem of ``network diameter''

I'm slapping this in here, but really the whole rant needs to be rewritten to acknowledge this fundamental issue. This issue is the real reason hubs, and hub-like things, have no future.

Gigabit Ethernet is at the edge of what you could use a hub for, because at 1Gbit/s, consider:

that would make the 64-byte collision window about 90 meters long. The substantial delay introduced by repeaters and PHYs brings the practical value I found on the Interweb down to about 20 meters. For 10Gbit/s Ethernet, now we're down to two meters.

802.3z uses 512-byte collision windows (and thus a 512-byte minimum packet size) to work around this problem and make the network diameter about the same as it was with 100BaseTX, but such a large collision window makes the cable become almost all noise when there is any contention, compared to 100BaseTX. Getting good cable utilization depends on the ``carrier sense'' characteristic---normal data packets need to be fairly long compared to the collision window. The large minimum packet size also has the odd effect that, for a unidirectional TCP flow, the stream of TCP ACKs will take about 1/6 the raw bits of the stream they're acknowledging, and considering that ACKs will now collide with the data a lot more often it's really much worse than 1/6th. Clearly, for high-speed moderate-distance networks, we need something that's full-duplex, and that means some kind of switch.


rants / map / carton's page / Miles Nordin <carton@Ivy.NET>
Last update (UTC timezone): $Id: switch-bad.html,v 1.1 2008/01/11 03:31:01 carton Exp $