One Night of Work -- carton 2006-11-19

It was a fantastic success! We started about twenty minutes late, then worked dilligently until 20:20.

There was a slight correction in attendance, but the enthusiastic response to the web page remains ubiquitous.

I stayed well away from CACert this time and instead worked on iSCSI.

Here is the motivation: I need a cheap >128GB hard disk for my Solaris/SPARC system.

so, my final conclusion is, there's no reliable way for a cost-conscious person to attach storage to a Solaris/SPARC box at all. And there's no way to attach any storage to any Solaris box, SPARC or x86, with an open-source driver.

I have a ZFS mirror made from one PATA disk and one SATA disk, both on Firewire right now. The plan for tonight: I will leave the PATA on Firewire, and move the SATA into a Linux box (where the SATA driver actually works), and export it to Solaris using Linux's iSCSI Enterprise Target.

This was a piece of cake. I literally spent 30min getting IET to work, and 30min getting the Solaris initiator to work.

IET uses a single /etc/ietd.conf to configure everything, and the syntax in that file is quite simple. There are a few standards-bloat knobs, but not too many.

The Solaris initiator follows their new convention of storing config in a binary uneditable file in /etc and forcing you to use a tool to edit the file. The upside here is that iscsiadm commands take effect immediately, and survive reboots. Like with Cisco, there is no difference between the command to change something and the command to make a change survive reboots. Unfortunately, it didn't work in practice. Perhaps because the box suffered an unclean shutdown (albeit MINUTES after I had done the iscsiadm configuration), the iSCSI passwords did not survive reboot. I wasn't able to reproduce the problem---it's persisted fine since then.

However there is an annoying bug with the Solaris initiator. If I restart IET on Linux, the iSCSI session of course dies. Solaris automatically tries to reconnect, but gives this error:

WARNING: iscsi connection(n) login failed - can't accept MaxOutstandingR2T 0

AFAICT there is no clean way to make Solaris reconnect a second time. Shouldn't this manual intervention be unnecessary, like a 1sec pause before reconnecting or a periodic reconnect attempt, or reconnect-on-attempted-device-access with a 10sec enforced quiet-period between each attempt or something? The Solaris devices shouldn't have to wedge until-manual-intervention after rebooting the Linux target. [update 2007-12-04: this seems working ok in Nevada b71.]

If I reboot the Solaris box, it works and connects. If I iscsiadm remove discovery-address 1.2.3.4 then iscsiadm add discovery-address 1.2.3.4, this does make it reconnect a second time and does work and doesn't need rebooting nor re-entry of CHAP passwords (passwords apparently accumulate undeleteably in there in some secret opaque config file), and is basically good enough, but what if I had two targets and only one was wedged? Removing the discovery-address would clear both targets' sessions. [update 2007-12-04: this is not fixed in b71. There should be a ``force-clear session'' feature, if for no other reason than just to regression-test CHAP changes before a maintenance window ends. And there should be a corresponding ``force immediate connection attempt'' command, which would be good given the dynamic /dev/dsk namespace which means you can't always even ask for a forceconnection by touching the device.] Anyway, not a huge deal, and easy enough to work around since it doesn't stop a box from booting up.

However the two big problems with ZFS mirrors I discovered for NYCBSDCon bit me hard.

First, I added the iSCSI disk to the mirror with zpool attach pool c3t0d0s3 c2t1d0s3, but the machine froze one hour into the process, before it could finish resilvering. The mouse still moved, but all X applications were frozen, no change in >30min. [update 2007-12-04: this is not happening with Nevada b71. but, i switched to more powerful hardware, too.]

Second, after I rebooted, the partially-resilvered disk prevented the machine from booting, even into single-user mode. so, adding a device to a zpool mirror can actually make your system less reliable because if any problem happens to any of the mirrors, you can't boot! [update 2007-12-04: this crap still happens. ok boot -m milestone=none will get you a prompt, and from there you need to either make the device more decisively gone (remove discovery-address for iSCSI), or convince ZFS not to look for it any more (expect lots of ``no valid replicas'' obstinence).] You can't get to your data! Without the mirror, I only have to worry about one device not working, but with the mirror any of the components could stop the box from coming up. pretty shabby. Missing Firewire mirror components seem to be ok (maybe the way Firewire sort of saves-a-spot for a disconnected disk?), but missing IDE components and missing iSCSI components stop bootup. Combined with the disklabel and zpool import obstinance, I feel much more exposed than I did with SVM. If these bugs are fixed I think zpool mirrors will have major advantages to SVM mirrors, like being able to know which mirror is more correct without metadb, but for now I feel exposed.

Again, zpool offline pool c2t1d0s3 does not work. It says ``no valid replicas.'' Only zpool status; zpool detach pool c2t1d0s3 rescued my system. I had to do it right after the single-user password prompt came up. Wait 5 sec and the system is wedged again. It also wedged if I didn't type the zpool status command first.

After pulling off this superfast zpool detach, I adventurously re-added the iSCSI mirror component, this time in single-user mode instead of on a booted system running X, and it worked---after about 7 hours I have a working resilvered system with a synchronized mirror, half over iSCSI and half over Firewire. And it consistently boots up just fine. I'm so relieved, and plan to move on---I have to get net-install-server working before the installfest.

Not sure why resilvering didn't work in X. Maybe it was some kind of coincidence involving the problem with the iSCSI initiator wedging. I know from experience with the two Firewire cases that if I unplug a case during resilvering, ZFS can wedge the system, while if I unplug the case while the pool is online and in sync ZFS does fine. I don't think I restarted IET during the first resilver attempt, but I could have bumped a cable or something. Since resilvering takes a day and risks making the machine not boot, I didn't repeat the resilver-with-X11-running experiment. Maybe later when I have enough hardware for an experimentation-only system.

BTW good luck getting the Linux iSCSI initiator to work. three different confusingly-named projects. Kernel patches. Ambiguous project-version-to-kernel-version mapping. lots of problems reported on the mailing lists. Versions talked about but missing from the download site. Under Linux, IET easy, but initiator hard.

update: today (2006-12-09) my Linux IET box crashed. It crashed in the typical Linux way, frozen solid with a black screen, and didn't even bother to reboot itself---I think turning off the console screensaver should be a Best-Practice for Linux ``servers'' because they always pull this bullshit. This is possible but not simple:

# with this, dmesg stuff goes to console.  normal running setting is '1', and it goes to syslog.
# i've no idea what the numbers mean.  I found it on some web page.
echo 8 > /proc/sys/kernel/printk

# linux screensaver doesn't turn off during panic, so don't use it.
TERM=linux setterm -powerdown 0 < /dev/tty1 > /dev/tty1
TERM=linux setterm -blank 0 < /dev/tty1 > /dev/tty1

But the sucky part: instantly after the Linux iSCSI target crashed, my Solaris box paniced! so, presumably the same thing would happen to it if I unplugged the network cable, or someone kicked the cord on the hub, or the network was DDoSed or whatever. The pool containing /usr should have only been DEGRADED because one component of the two-disk mirror went away. The other component was still there on Firewire.

Solaris automatically rebooted, then hung before printing the hostname continuously printing errors about being unable to reconnect to the iSCSI target. I tried to 'boot -s', but that didn't work, either. It froze before printing the sulogin prompt. I tried to boot off DVD, but it took like 15 minutes to probe devices with minimal CD-ROM activity, so I thought it was hung just like the regular 'boot -s' (turns out it probably wasn't, just slow). What finally did work:

  1. Unplug the Firewire half of the mirror so only the iSCSI component is available.
  2. Boot. For some reason it doesn't like the iSCSI component, so it says the pool containing /usr is FAULTED, and drops straight to a root prompt without /usr mounted. not sure why. Does it know from zpool.cache that the iSCSI half of the mirror is ``less recent''? When I had two Firewire mirror components, my impression was that ZFS could not tell which component was stale.
  3. /sbin/zpool does not work here because a few libraries are in /usr/lib instead of /lib. After I finished these steps, I copied the libraries to /lib, and now I can run /sbin/zpool without /usr mounted.
  4. Plug the Firewire disk back in. Now the pool should be DEGRADED, though I can't see this because I can't run zpool.
  5. /sbin/mount -o ro /usr
  6. zpool status and wait for the resilver to complete. Note it is not actually resilvering anything---there is only one ONLINE component to the mirror right now. But if you try to do anything to the pool, it will say no valid replicas until you let this fake-resilver complete. It only took about three minutes, hundreds of times faster than the resilvering that happens when attaching a blank disk to the mirror.
  7. zpool detach pool <iSCSI target>
  8. reboot
  9. now I found the 'scratch' pool is still FAULTED. It contains a filesystem mounted on /export/dropbox/rawdvd, and it's composed of one iSCSI component only. At this point in the recovery saga, the iSCSI component is back---I can read from it with 'dd' and see it in 'format -e'. And the 'pool' pool containing /usr is back, without iSCSI. But the 'scratch' pool isn't back. zpool online scratch c2t1d0s4 doesn't work, which is expected as it's not meant for FAULTED pools---a FAULTED pool is supposed to keep trying to reopen the device.

I learned there is an alternative to 'boot -s' if booting hangs:

boot -m milestone=none

Some Sun document suggests booting this way and removing /etc/zfs/zpool.cache if ZFS is preventing the system from starting up. Then, all the pools have to be zpool import -f'ed to rebuild zpool.cache. Hopefully I can run zpool without /usr mounted now, because /usr is inside one of those pools that will need importing. I haven't tried it yet.

I have problems because so much depends on this Solaris box. I do my web browsing from it. I have two more web browsers on an iMac and an Ultra 5, but both are NFS booting off this same box with the ZFS panicing. That's kinda cool actually, but I think I should buy some more computers so that I can read manuals while trying to repair things, instead of guessing my way through then going back to find all the manuals to write this work-log, after the work is finished.

I'm going to leave the iSCSI plex unattached until I can upgrade to a more recent SXCE, and see if the ZFS boot hangs and panics get any better with the new version. I doubt it will---on the mailing list the ZFS guys are stonewalling, saying they won't fix any problems of this class until they finish their ZFS-FMA integration project. Apparently a substantial promised piece of ZFS still is not written. A poster on the list said they wrap all their ZFS devices in single-plex SVM virtual partitions, and this fixes some of the ZFS hangs. I will do that at least to the iSCSI component. Once I do this, I will try unplugging the network and killing Linux ietd on purpose and stuff.

Clearly some sysadmin education is helping here, both basic Solaris stuff like the new -m milestone=none, knowing to wait 15min for a boot CD to probe hardware, and ZFS-specific stuff like when an import/export dance will fix ZFS's silent obstinance. It looks like the ZFS architecture, concepts, style, is pretty good, even amazing. But (a) there are some serious fault management problems with major availability implications that seem to me totally inappropriate for a stable release, and (b) sometimes too many different anomalous/error conditions are captured under one term in the tools like, in this adventure above, ``cannot open'', ``resilvering'', and ``no valid replicas''.

As for Linux, I don't care---I already know it's flakey, which is the whole reason it's supposed to be relegated to iSCSI targets hidden behind this ZFS redundancy layer. Eventually I want to have four-disk raidZ stripes attached from four separate Linux PeeCees, so if one crashes I can keep right on going. I can lose a whole tower. Right now, Solaris apparently can't handle losing an iSCSI target without panicing. seems ridiculous to me, but it did just happen.

update: 2007-04-03

With the iSCSI device wrapped inside a single-member SVM stripe before it's added to the ZFS mirror, the Solaris box does come up while the Linux iSCSI target is down/unreachable. svc.startd eventually transitions metasync to maintenance. However, it comes up very slowly, tens of times longer than usual. Less methodical sysadmins might have just called it broken. And I don't know what effect this has on other SVM volumes. metastat says:

metastat: amber: 
        system/metasync:default: service(s) not online in SMF

definitely not ideal. Also, zpool status takes forEVer. I found out why:

ezln:~$ sudo tcpdump -n -p -i tlp2 host fishstick
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on tlp2, link-type EN10MB (Ethernet), capture size 96 bytes
17:44:43.916373 IP 10.100.100.140.42569 > 10.100.100.135.3260: S 582435419:582435419(0) win 49640 
17:44:43.916679 IP 10.100.100.135.3260 > 10.100.100.140.42569: R 0:0(0) ack 582435420 win 0
17:44:52.611549 IP 10.100.100.140.48474 > 10.100.100.135.3260: S 584903933:584903933(0) win 49640 
17:44:52.611858 IP 10.100.100.135.3260 > 10.100.100.140.48474: R 0:0(0) ack 584903934 win 0
17:44:58.766525 IP 10.100.100.140.58767 > 10.100.100.135.3260: S 586435093:586435093(0) win 49640 
17:44:58.766831 IP 10.100.100.135.3260 > 10.100.100.140.58767: R 0:0(0) ack 586435094 win 0

10.100.100.135 is the iSCSI target. When it's down, connect() from the Solaris initiator will take a while to time out. I added its address as an alias on some other box's interface, so Solaris would get a TCP reset immediately. Now zpool status is fast again, and every time I type zpool status, I get one of those SYN, RST pairs. (one, not three. I typed zpool status three times.) They also appear on their own over time.

How would I fix this? I'd have iSCSI keep track of whether targets are ``up'' or ``down''. If an upper layer tries to access a target that's ``down'', iSCSI will immediately return an error, then try to open the target in the background. There will be no automatic attempts to open targets in the background. so, if an iSCSI target goes away, and then it comes back, your software may need to touch the device inode twice before you see the target available again.

If targets close their TCP circuits on inactivity or go into power-save or some such flakey nonsense, we're still ok, because after that happens iSCSI will still have the target marked ``up.'' It will thus keep the upper layers waiting for one connection attempt, returning no error if the first connection attempt succeeds. If it doesn't, the iSCSI initiator will then mark the target ``down'' and start returning errors immediately.

As I said before, error handling is the most important part of any RAID implementation. In this case, among the more obvious and immediately inconvenient problems we have a fundamentally serious one: iSCSI's not returning errors fast enough is pushing us up against a timeout in the svc subsystem, so one broken disk can potentially cascade into breaking a huge swath of the SVM subsystem.

the moral is:

Here are some more disorganized ZFS/iSCSI notes. These aren't really appropriate for sharing with other people. The point was to record what I found with Nevada b71 and how to work around the many, many problems. but I don't have interest to edit them now.


One Night of Work / map / carton's page / Miles Nordin <carton@Ivy.NET>
Last update (UTC timezone): $Id: 20061119-carton.html,v 1.8 2007/12/05 00:06:49 carton Exp $