The way things have panned out, these L3 switches are similar to each other and have many of the same quirks---as a group, they're not as varied as things that called themselves routers. Also the evolution of our toolbox---VoIP, iSCSI, cheap 1000BaseT, TCP enhancements and the practical, simplified DiffServ model for QoS---has doomed the hub-and-CPUbased-router model.
I still have some of the problems I originally complained about, but new L3 switches are like routers and deal with many of my objections that once made me stubbornly buy a bunch of managed 100BaseTX hubs and use a Unix box as an L3 router. I've made a table of things I used to hate about switches, and how my current batch of L3 switches measures up to each objection.
Objection | Resolution |
---|---|
storms. I mentioned Xerox's problems with large switched domains. Xerox's problem was, to my view, a tiny bug in an unnecessarily fragile architecture. But there are too many knobs for tweaking STP, and there are unmanaged low-quality switches in individual cubicles, which make these problems worse. |
Since then, there are new stories about high-value,
expensively-managed networks that collapsed because sysadmins had too
much faith in their expensive switches: a
hospital
network in Boston, and a
MAN
in New Zeland. It's still happening, and the stories are fantastic. It's possible, and I believe a best-practice, to avoid large STP domains. Really it's best to avoid STP period. L3 switches make this possible, but the obvious configuration gives up the VPN features these big organizations became accustomed to getting from their oversized STP domains. In the old, big, fragile L2 networks, VLAN's were cheap things. One could have trunks with tens or hundreds of VLAN's, and do it without excessive STP traffic thanks to IEEE MST or Extreme's EMISTP. One could punch down edge ports onto VLAN's with no regard for geography, and big companies like this so they can have a mess of firewalls. MAN's would like it, too, so they have a convenient way to build none-of-our-business point-to-point Ethernet-handoff links between pairs of customers. This is achievable in an L3 world, but the new tools are not drop-in replacements, and they also require special switch features that not everyone has (my ExtremeWare 7.x switches don't have the needed features). Finally, IPv6 works mostly okay on L2 switches, except for a problem with multicast. And IPv6 works fine with most routers. Juniper has had good v6 support for a long time. Cisco's v6 became decent a little bit more recently, but they back-ported it to even very old and cheap routers like the 1721: CEF, OSPF, BGP, all working. But for L3 switches, it's a different story! The brand new L3 switches Extreme is selling forward IPv6 in software. It's offensive they even bothered to implement any kind of software forwarding on a switch with twenty-four gigabit ports. My impression is that older switches have been brittle in this department. If the whole switch doesn't need replacing, which it probably will if it's an old L3 edge switch, you may still need to replace a ``policy-feature card'' or a ``longest-match forwarding engine card'' or something of the sort, while at least in Ciscoland the huge pricetag of a single-port FastEthernet HWIC or NM apparently does buy you something, because these old line cards were software-upgradeable to IPv6 including support for things like hardware-switched tunneling that switches of any age usually have to do in software. This means, you may reconfigure everything to the new L3 world right out to the network edge to avoid these storm disasters, then find yourself painted into a corner when you want to roll out IPv6. I did. well, I was already running IPv6 on my hilarious hub-and-Unix-router LAN before I installed any switches, so it didn't exactly sneak up and surprise me, but I'm still stuck. I deal by running a bunch of L3 VLAN's for IPv4---the best practice I'm describing intended to avoid the big storms of the old days---but then run one big VLAN for IPv6. Extreme lets you sort packets into VLAN's based on ethertype rather than 802.1q tag, so I grab just the IPv6 ethertype of 0x86dd and L2 switch it into a single big L2 STP VLAN. Cisco's older L3 switches have a similar feature called ``integrated routing-bridging'' where they will L2-switch any ethertypes their L3 silicon doesn't recognize. It works, but either way, you are stuck with a big STP domain. So what if there's no IPv4 traffic on it? Whether there's lots of traffic on it or just a few stray packets, as long as it's there, you're still at just as much risk of storms and huge outages. The MAN storm problem is not solved until we have IPv6 support in silicon, in all L3 switches right out to the edge, and it seems to be much harder for these vendors to backport v6 onto their old switches than onto their old routers. IPv6 is still single-vendor (Cisco), and brand new thus probably unstable, in L3 switches as of 2007. so: not solved.
|
Can't sniff |
Cisco SPAN, RSPAN, and ERSPAN. Extreme port mirroring (which is like
SPAN). There's no way to sniff all traffic on a VLAN, which is trivial with hubs and routers. One can sniff all traffic on a VLAN on one specific switch, but L2 architecture makes it a hard problem to sniff the entire switched domain from a single port. You will have to configure monitoring separately through the management interface of each switch in the VLAN. L3 switches ease this pain quite a bit because it's a good and practical idea to have no VLAN's that span more than one switch, and it is possible with these features to monitor all switch traffic in a particular VLAN through a single port.
These *SPAN features are full of nasty quirks and small static limits,
but I don't think there's a better or safer way to combine the complex
filter-and-store of a Also, there are sFlow and Netflow. but it still kind of sucks. Extreme's SPAN-like feature only works on a single switch, so the sniffer has to plug into the same switch as the port being sniffed. That sucks a lot. Cisco's RSPAN and ERSPAN suck less, but they're still a little scary.
|
can't use traceroute | fixed! |
can't handle MTU differences |
fixed! L3 switches will route between VLAN's and generate ICMP too-big messages when packets cross into a VLAN with smaller MTU. This is so great. Test and be careful, though. The MTU config on ExtremeWare is a little confusing---I made a cheatsheet. Sometimes there are config options about whether to generate the ICMP too-big messages or not. I remember reading this, but my switches are generating them, so I'm happy. Also the too-big message needs to include a little bit of the header of the packet that's too big. There are problems for everyone if this header is simply missing, and problems for IPsec if too little header is included. I bet L3 switches are more likely to get this wrong than real routers, but I haven't run into a problem there yet. Finally, setting the size of jumbo frames is important with L3 switches. With L2 switches, they needed to handle ``big enough'' frames, and then you needed to make sure all the end systems on a given VLAN had the same MTU configured. There was no need for precise agreement between end systems and the switch. Now, there is. My cheatsheet is fine for ExtremeWare 7.x, but ExtremeWare 4.x seems to be missing some knobs. Also it seems like you're limited to one jumbo size---you can't have some VLAN's a little bigger or smaller on the same switch. Of course, for IPv6 you need an IPv6-capable L3 switch, which I don't have, so differing IPv6 MTU has to go through the Unix router still. (XXX -- does NDP handle hosts with differing MTU's on a single broadcast domain or not?)
|
can't do QoS on iptos/dscp or on TCP/UDP port number |
fixed! Incredibly complicated classifiers and marker/policers are
possible, although the schedulers are still extremely primitive. In
particular work-conserving link-sharing schedulers like HFSC are
missing. There is a flat array of queues, not a heirarchy. There are
only two to eight queues available. There's a lot of somewhat goofy
encouragement to do QoS on ingress. But, this is the DiffServ model.
Yes, it's a disappointment compared to HFSC, but I think it has some
potential. I need to look into this a little more carefully, though. Can I prioritize TCP ACK's with no data payload, or does that require a level of packet inspection not implemented? Is the lack of a proper stateful firewall a real problem, or are the access lists they can implement in silicon good enough?
|
can't do ECN |
at least it's theoretically fixable. My older Extreme switches do
RED, but not ECN, which is the same as what an L2 switch can
theoretically do. I haven't checked yet if I can do RED on one class
of traffic, FIFO on another, with Extreme. That ability, combined
with L3-aware classification, would be a useful improvement over L2
switches because I could use a FIFO queue, or a RED queue with a large
zero-drop-probability size, for non-TCP traffic. At least the Cisco
6500 is supposed to have abilities like that. Here is some background about FIFO, RED, and ECN.
|
networks are naturally one-to-many. We were promised this full mesh of non-blocking switching would speed things up tremendously, but in practice it doesn't get used. |
We need full-duplex links, for at least two big reasons:
to implement extremely low-jitter QoS, and because
a reasonable collision window at 1gbit/s is only a few
meters long. These requirements justify making a switching mesh
rather than a broadcast domain. However, at network edges, I think switchports are still really expensive, and backplane capacities still sneerably underfilled. I did hear an interesting argument on this point, though. If the single server and each of a hundred clients have the same size network pipe instead of giving clients NIC's 1/10th the size of the server's, yes, most of the clients will have underused network connections, but the server will tend to fill client requests more sequentially, less in parallel. This can reduce memory requirements and context switching in the server. I don't know, how real is this effect? Also server software is under great pressure right now to serve parallel requests efficiently because of Internet web users, so, trend-wise, I don't know if there's room to settle into the benefits of a more serial workload, even if the speculated benefit really does exist.
|
Using switches at the edge and routers at the core is a marketing strategy to soak everyone for what they're willing to pay. From a technical standpoint, we should use switches in the core and routers near the edge. |
I'm slightly patting myself on the back for this, because in the old
rant I almost seemed to describe MPLS: no routing table inside the
core switches (label-swapping, label distribution), edge routers
``cooperating'' to implement QoS on the core switches' output ports
(MPLS-TE). sort of. And this is exactly the architecture L3 switch
vendors seemed to imagine, with even fixed-configuration edge switches
able to speak MPLS if you order them with the right options. They are
quite ready to let you run a fixed-configuration switch with one VLAN
per switchport, one switchport per apartment, one fixed-confuration
switch per urban building, and all their fixed-configuration switches
have at least two fiber ports for some kind of metro ring. It's this
wonderful STP-free future! And it's here now, and the equipment's not
too expensive, if you can afford to hire someone who knows how
to configure it. But even when we use MPLS, so far we only use it within an AS. We have two kinds of core on the Internet today: the cores of the Tier 1 ISP's, which I would expect are MPLS because they are all flogging this ``managed WAN'' thing now. The second kind, the cores of Internet exchanges, I believe are some combination of ``don't exist---it's just a rat's nest of passive wiring connecting pairs of BGP-speaking routers'' (for extremely fast peerings), and IX-managed plain L2 switches with VLAN's (for most members of the IX). The is, in some sense, an L2 core with BGP next-hops acting as the labels. Here one question on which I don't know what call to make: will multicast ever cross AS boundaries? So far I think it's only used to deliver broadcast television or CCTV (``video conferencing,'' ha ha) within an AS. If it ever does, we'll need AS-crossing QoS/traffic-engineering extensions, too, because subscribing to a multicast feed will need to go through some RSVP-ish check to make sure no link is overwhelmed. Do you think? Or will interconnects always be faster than all television that exists anywhere? or will there be more primitive oversubscription-throttling mechanisms? Will there be ISP-level DDoS's caused by worms that deliberately make each customer subscribe to different obscure Eastern European television feeds? If we ever do have multicast, and thus TE, crossing AS boundaries, then I think MPLS will develop more complicated control plane protocols, BGP-ish ones, and an MPLS core will replace the plain L2 core in IX's.
|
With hubs, packets aren't lost. During congestion, they wait inside each traffic-sourcing host for space on the wire. Also there are L1 retransmissions. |
I think typical hosts are still rather dumb, will drop UDP packets at
various spots inside the network stack, and don't give applications the kind of API they
need to manage queue lengths inside the kernel to rate-adapt and
achieve low latency, so applications always treat the network as lossy
and try to probe and learn it, even if the same information is more
deterministically available from the kernel. This could change,
but...we've been waiting a while now. It's still a problem. I think fibre channel and infiniband meshes don't drop packets, which means they can run SCTP, can talk to extremely wide RAID stripes without overloading buffers, and like hubs can push queues all the way back past the ingress border where the traffic-originating device can manage the buffer. I've heard of gigabit ethernet ``flow control,'' but don't know what it is, how or when it works, or where it's ever been used to make a working end-to-end lossless network. It doesn't really make sense as adequate if I'm right in assuming that all the gigabit receiver can say is ``stop sending''. What needs to be said is far more complicated, is more like, ``Frame blocked---Your next turn to send me something for MAC <n> comes in 78 timeslices, so reoffer me this frame at that time. For now, reorder your transmit queue, and offer me a different frame.'' Granted, a potentially high-latency L3 network can't be lossless like FC-SW or IB. The lossless strategy at L3 is to stop using UDP entirely, and to use TCP and ECN (or RSTP and ECN?), which as discussed above, is possible now at least in theory. It's interesting that giving up on hubs means now NFS has to run over TCP, and we're interested in non-Ethernet L1's again. And this low-latency cut-through issue hasn't gone away between my outdated L2 rant and modern L3 switches. Hubs really were special.
|