It was a fantastic success! We started about twenty minutes late, then worked dilligently until 20:20.
There was a slight correction in attendance, but the enthusiastic response to the web page remains ubiquitous.
I stayed well away from CACert this time and instead worked on iSCSI.
Here is the motivation: I need a cheap >128GB hard disk for my Solaris/SPARC system.
dad
is closed-source as of 2006-12-17. (update 2007-04-03: this may be changing, because I saw a message on onnv-notify ``6535388 warlock build fails for dad due to move from closed area'') (update 2007-04-10: actually I may be wrong about this. It needs testing, on both x86 and SPARC. but I still think we need source for a basic driver like 'dad'. ``core OS'' indeed.)
common/
directory, but this is not correct. Solaris SATA is x86 only. so, it is impossible for now. No other problems matter.
so, my final conclusion is, there's no reliable way for a cost-conscious person to attach storage to a Solaris/SPARC box at all. And there's no way to attach any storage to any Solaris box, SPARC or x86, with an open-source driver.
I have a ZFS mirror made from one PATA disk and one SATA disk, both on Firewire right now. The plan for tonight: I will leave the PATA on Firewire, and move the SATA into a Linux box (where the SATA driver actually works), and export it to Solaris using Linux's iSCSI Enterprise Target.
This was a piece of cake. I literally spent 30min getting IET to work, and 30min getting the Solaris initiator to work.
IET uses a single /etc/ietd.conf to configure everything, and the syntax in that file is quite simple. There are a few standards-bloat knobs, but not too many.
The Solaris initiator follows their new convention of storing config
in a binary uneditable file in /etc and forcing you to use a tool to
edit the file. The upside here is that iscsiadm
commands
take effect immediately, and survive reboots. Like with Cisco, there
is no difference between the command to change something and the
command to make a change survive reboots. Unfortunately, it didn't
work in practice. Perhaps because the box suffered an unclean
shutdown (albeit MINUTES after I had done the iscsiadm configuration),
the iSCSI passwords did not survive reboot. I wasn't able to
reproduce the problem---it's persisted fine since then.
However there is an annoying bug with the Solaris initiator. If I restart IET on Linux, the iSCSI session of course dies. Solaris automatically tries to reconnect, but gives this error:
WARNING: iscsi connection(n) login failed - can't accept MaxOutstandingR2T 0
AFAICT there is no clean way to make Solaris reconnect a second time. Shouldn't this manual intervention be unnecessary, like a 1sec pause before reconnecting or a periodic reconnect attempt, or reconnect-on-attempted-device-access with a 10sec enforced quiet-period between each attempt or something? The Solaris devices shouldn't have to wedge until-manual-intervention after rebooting the Linux target. [update 2007-12-04: this seems working ok in Nevada b71.]
If I reboot the Solaris box, it works and connects. If I
iscsiadm remove discovery-address 1.2.3.4
then
iscsiadm add discovery-address 1.2.3.4
, this does make it
reconnect a second time and does work and doesn't need rebooting nor
re-entry of CHAP passwords (passwords apparently accumulate
undeleteably in there in some secret opaque config file), and is
basically good enough, but what if I had two targets and only one was
wedged? Removing the discovery-address would clear both targets'
sessions. [update 2007-12-04: this is not fixed in b71. There should be a ``force-clear
session'' feature, if for no other reason than just to regression-test
CHAP changes before a maintenance window ends. And there should be a
corresponding ``force immediate connection attempt'' command, which
would be good given the dynamic /dev/dsk namespace which means you
can't always even ask for a forceconnection by touching the device.]
Anyway, not a huge deal, and easy enough to work around since it
doesn't stop a box from booting up.
However the two big problems with ZFS mirrors I discovered for NYCBSDCon bit me hard.
First, I added the iSCSI disk to the mirror with zpool attach
pool c3t0d0s3 c2t1d0s3
, but the machine froze one hour into the
process, before it could finish resilvering. The mouse still moved,
but all X applications were frozen, no change in >30min.
[update 2007-12-04: this is not happening with Nevada
b71. but, i switched to more powerful hardware, too.]
Second, after I rebooted, the partially-resilvered disk prevented the
machine from booting, even into single-user mode. so, adding a device
to a zpool mirror can actually make your system less reliable
because if any problem happens to any of the mirrors, you can't boot!
[update 2007-12-04: this crap still happens. ok boot -m milestone=none
will
get you a prompt, and from there you need to either make the device
more decisively gone (remove discovery-address for iSCSI), or convince
ZFS not to look for it any more (expect lots of ``no valid replicas''
obstinence).] You can't get to your data! Without the mirror, I only
have to worry about one device not working, but with the mirror any of
the components could stop the box from coming up. pretty shabby.
Missing Firewire mirror components seem to be ok (maybe the way
Firewire sort of saves-a-spot for a disconnected disk?), but missing
IDE components and missing iSCSI components stop bootup. Combined
with the disklabel and zpool import
obstinance, I feel
much more exposed than I did with SVM. If these bugs are fixed I
think zpool mirrors will have major advantages to SVM mirrors, like
being able to know which mirror is more correct without metadb, but
for now I feel exposed.
Again, zpool offline pool c2t1d0s3
does not work. It
says ``no valid replicas.'' Only zpool status; zpool detach
pool c2t1d0s3
rescued my system. I had to do it right after
the single-user password prompt came up. Wait 5 sec and the system is
wedged again. It also wedged if I didn't type the zpool
status
command first.
After pulling off this superfast zpool detach, I adventurously re-added the iSCSI mirror component, this time in single-user mode instead of on a booted system running X, and it worked---after about 7 hours I have a working resilvered system with a synchronized mirror, half over iSCSI and half over Firewire. And it consistently boots up just fine. I'm so relieved, and plan to move on---I have to get net-install-server working before the installfest.
Not sure why resilvering didn't work in X. Maybe it was some kind of coincidence involving the problem with the iSCSI initiator wedging. I know from experience with the two Firewire cases that if I unplug a case during resilvering, ZFS can wedge the system, while if I unplug the case while the pool is online and in sync ZFS does fine. I don't think I restarted IET during the first resilver attempt, but I could have bumped a cable or something. Since resilvering takes a day and risks making the machine not boot, I didn't repeat the resilver-with-X11-running experiment. Maybe later when I have enough hardware for an experimentation-only system.
BTW good luck getting the Linux iSCSI initiator to work. three different confusingly-named projects. Kernel patches. Ambiguous project-version-to-kernel-version mapping. lots of problems reported on the mailing lists. Versions talked about but missing from the download site. Under Linux, IET easy, but initiator hard.
update: today (2006-12-09) my Linux IET box crashed. It crashed in the typical Linux way, frozen solid with a black screen, and didn't even bother to reboot itself---I think turning off the console screensaver should be a Best-Practice for Linux ``servers'' because they always pull this bullshit. This is possible but not simple:
# with this, dmesg stuff goes to console. normal running setting is '1', and it goes to syslog.
# i've no idea what the numbers mean. I found it on some web page.
echo 8 > /proc/sys/kernel/printk
# linux screensaver doesn't turn off during panic, so don't use it.
TERM=linux setterm -powerdown 0 < /dev/tty1 > /dev/tty1
TERM=linux setterm -blank 0 < /dev/tty1 > /dev/tty1
But the sucky part: instantly after the Linux iSCSI target crashed, my Solaris box paniced! so, presumably the same thing would happen to it if I unplugged the network cable, or someone kicked the cord on the hub, or the network was DDoSed or whatever. The pool containing /usr should have only been DEGRADED because one component of the two-disk mirror went away. The other component was still there on Firewire.
Solaris automatically rebooted, then hung before printing the hostname continuously printing errors about being unable to reconnect to the iSCSI target. I tried to 'boot -s', but that didn't work, either. It froze before printing the sulogin prompt. I tried to boot off DVD, but it took like 15 minutes to probe devices with minimal CD-ROM activity, so I thought it was hung just like the regular 'boot -s' (turns out it probably wasn't, just slow). What finally did work:
/sbin/zpool
does not work here because a few
libraries are in /usr/lib instead of /lib. After I finished these
steps, I copied the libraries to /lib, and now I can run
/sbin/zpool
without /usr mounted.
zpool
.
/sbin/mount -o ro /usr
zpool status
and wait for the resilver to complete.
Note it is not actually resilvering anything---there is only one
ONLINE component to the mirror right now. But if you try to do
anything to the pool, it will say no valid replicas
until
you let this fake-resilver complete. It only took about three
minutes, hundreds of times faster than the resilvering that happens
when attaching a blank disk to the mirror.
zpool detach pool <iSCSI target>
zpool online scratch
c2t1d0s4
doesn't work, which is expected as it's not meant for
FAULTED pools---a FAULTED pool is supposed to keep trying to reopen
the device.
zpool export scratch
zpool import -f scratch
I learned there is an alternative to 'boot -s' if booting hangs:
boot -m milestone=none
Some Sun
document suggests booting this way and removing
/etc/zfs/zpool.cache if ZFS is preventing the system from starting up.
Then, all the pools have to be zpool import -f
'ed to rebuild
zpool.cache. Hopefully I can run zpool without /usr mounted now,
because /usr is inside one of those pools that will need importing. I
haven't tried it yet.
I have problems because so much depends on this Solaris box. I do my web browsing from it. I have two more web browsers on an iMac and an Ultra 5, but both are NFS booting off this same box with the ZFS panicing. That's kinda cool actually, but I think I should buy some more computers so that I can read manuals while trying to repair things, instead of guessing my way through then going back to find all the manuals to write this work-log, after the work is finished.
I'm going to leave the iSCSI plex unattached until I can upgrade to a
more recent SXCE, and see if the ZFS boot hangs and panics get any
better with the new version. I doubt it will---on the mailing list
the ZFS guys are stonewalling, saying they won't fix any problems of
this class until they finish their ZFS-FMA integration project.
Apparently a substantial promised piece of ZFS still is not written.
A poster on the list said they wrap all their ZFS devices in
single-plex SVM virtual partitions, and this fixes some of the ZFS
hangs. I will do that at least to the iSCSI component. Once I do
this, I will try unplugging the network and killing Linux
ietd
on purpose and stuff.
Clearly some sysadmin education is helping here, both basic Solaris
stuff like the new -m milestone=none
, knowing to wait
15min for a boot CD to probe hardware, and ZFS-specific stuff like
when an import/export dance will fix ZFS's silent obstinance. It
looks like the ZFS architecture, concepts, style, is pretty good, even
amazing. But (a) there are some serious fault management problems
with major availability implications that seem to me totally
inappropriate for a stable release, and (b) sometimes too many
different anomalous/error conditions are captured under one term in
the tools like, in this adventure above, ``cannot open'',
``resilvering'', and ``no valid replicas''.
As for Linux, I don't care---I already know it's flakey, which is the whole reason it's supposed to be relegated to iSCSI targets hidden behind this ZFS redundancy layer. Eventually I want to have four-disk raidZ stripes attached from four separate Linux PeeCees, so if one crashes I can keep right on going. I can lose a whole tower. Right now, Solaris apparently can't handle losing an iSCSI target without panicing. seems ridiculous to me, but it did just happen.
update: 2007-04-03
With the iSCSI device wrapped inside a single-member SVM stripe before
it's added to the ZFS mirror, the Solaris box does come up while the
Linux iSCSI target is down/unreachable. svc.startd eventually
transitions metasync
to maintenance. However, it comes
up very slowly, tens of times longer than usual. Less methodical
sysadmins might have just called it broken. And I don't know what
effect this has on other SVM volumes. metastat
says:
metastat: amber: system/metasync:default: service(s) not online in SMF
definitely not ideal. Also, zpool status
takes forEVer.
I found out why:
ezln:~$ sudo tcpdump -n -p -i tlp2 host fishstick tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on tlp2, link-type EN10MB (Ethernet), capture size 96 bytes 17:44:43.916373 IP 10.100.100.140.42569 > 10.100.100.135.3260: S 582435419:582435419(0) win 4964017:44:43.916679 IP 10.100.100.135.3260 > 10.100.100.140.42569: R 0:0(0) ack 582435420 win 0 17:44:52.611549 IP 10.100.100.140.48474 > 10.100.100.135.3260: S 584903933:584903933(0) win 49640 17:44:52.611858 IP 10.100.100.135.3260 > 10.100.100.140.48474: R 0:0(0) ack 584903934 win 0 17:44:58.766525 IP 10.100.100.140.58767 > 10.100.100.135.3260: S 586435093:586435093(0) win 49640 17:44:58.766831 IP 10.100.100.135.3260 > 10.100.100.140.58767: R 0:0(0) ack 586435094 win 0
10.100.100.135 is the iSCSI target. When it's down,
connect()
from the Solaris initiator will take a while to
time out. I added its address as an alias on some other box's
interface, so Solaris would get a TCP reset immediately. Now
zpool status
is fast again, and every time I type
zpool status
, I get one of those SYN, RST pairs. (one,
not three. I typed zpool status
three times.) They also
appear on their own over time.
How would I fix this? I'd have iSCSI keep track of whether targets are ``up'' or ``down''. If an upper layer tries to access a target that's ``down'', iSCSI will immediately return an error, then try to open the target in the background. There will be no automatic attempts to open targets in the background. so, if an iSCSI target goes away, and then it comes back, your software may need to touch the device inode twice before you see the target available again.
If targets close their TCP circuits on inactivity or go into power-save or some such flakey nonsense, we're still ok, because after that happens iSCSI will still have the target marked ``up.'' It will thus keep the upper layers waiting for one connection attempt, returning no error if the first connection attempt succeeds. If it doesn't, the iSCSI initiator will then mark the target ``down'' and start returning errors immediately.
As I said before, error handling is the most important part of any
RAID implementation. In this case, among the more obvious and
immediately inconvenient problems we have a fundamentally serious one:
iSCSI's not returning errors fast enough is pushing us up against a
timeout in the svc
subsystem, so one broken disk can
potentially cascade into breaking a huge swath of the SVM
subsystem.
the moral is:
Here are some more disorganized ZFS/iSCSI notes. These aren't really appropriate for sharing with other people. The point was to record what I found with Nevada b71 and how to work around the many, many problems. but I don't have interest to edit them now.