>>>>> "il" == Isaac Levy <ike@xxx> writes: il> Pawel Jakub Dawidek, "FreeBSD and ZFS". I have been using it on Solaris for a little over a year, and it really is ``that good''. and some of the problems I blogged about last December have been fixed between nevada b44 and b71. It's still not perfect, though, and some of these problems will certainly spill into FreeBSD's implementation: * I'm still having some problems that the machine panics if a disk goes away. panic on strange filesystem stuff (and even in some cases I think kernel memory corruption if some on-disk data structure is garbage?) was the norm with FFS. but this norm needs to end. * I still don't understand the state machine for mirrors---if half a mirror goes away, then comes back, when will ZFS notice it's out of sync, right away or after scrub? - claim is that it notices right away, and yes, there is a mini-resilver that happens after the mirror is rejoined. But if I do 'zpool scrub pool' after the mini-resliver finishes, scrub still finds inconsistencies. - errors reported by 'zpool status' including mirror inconsistency ``please scrub me by hand'' errors tend to vanish after rebooting. It forgets that it noticed the mirror was inconsistent. That doesn't seem okay. for things like iSCSI (restarting the daemon) or scratchy firewire connections (targets go away and come back, at worst maybe even with a different device name), it's important to deal with a mirror component that vanishes for, say, 2.5 seconds, then comes back, in a solid and graceful way. The real message here, though, is an optimistic one: that ZFS has given an architecture and a style that makes it possible to ask for something so ridiculous as ``please gracefully deal with targets that vanish for 2.5 seconds and re-appear on a different device node,'' which would be impossible with a regular LVM/geom/RAIDframe type system, or even for a hardware system without a gigabyte of NVRAM. * Also there is a missing feature which would be very nice: LVM's 'pvmove' command to migrate data off a vdev onto the other (possibly just-added) empty vdevs, so that you can safely remove the whole old vdev from the pool. but yeah ZFS is probably better than everything else. but there is so much obvious and non-obvious stuff that's fantastic about it. For example I think the idea of scrubbing is a non-obvious fantastic thing. For near-line storage, it's common for disks to go bad quietly---you don't know they're bad until you try to access the seldomly-read data, which is terrible because you end up with mirrors and RAID5's that have multiple bad components. so in my opinion disks in a mirror or RAID5 should be tested with 'dd if=/dev/disk of=/dev/null', or with some kind of SMART testing (offline testing or background testing?) every couple months, something that reads every block. but this practice is not common, and even if you think a drive might be bad and fsck it, this practice is not even done by modern fsck invoked in the normal way. It hasn't been done since the ancient days of disks that didn't remap bad blocks. The practice needs to come back---not necessarily reading of unallocated areas of the disk, but at least for every block that's holding data there should be a bimonthly test-read, and 'zpool scrub' does this in an I think O(n) way, and its use is a common ZFS best-practice. People always have strange, complicated, long stories about how they lose their data, but my impression is home users tend to lose everything about once every one or two years, and experienced Unix people maybe every five years or so? I think sometihng like scrubbing a giant near-line array vs. not doing that can increase its in-practice lifespan by many years. I haven't lost everything yet, but I do have these habitual mini-disasters that need to stop. I had an unmirrorred single-disk ZFS go bad recently---the drive was still working but had read and write errors. I go through this marginal-disk problem a lot, and the answer for me is usually: dd if=/dev/broken of=/dev/newdisk bs=512 conv=noerror,sync fsck /dev/newdisk so, with ZFS, this becomes more like: dd if=/dev/broken of=/dev/newdisk bs=512 conv=noerror,sync zpool import [look for the pool's serial number] zpool import -f 73710598603223 zpool scrub pool There were two regions of read errors on the disk. When ZFS's scrub finished, 'zpool status' gave me the pathnames of the files that were corrupted by dd's replacing with zeroes. I didn't need the files, so I deleted them, and now ZFS shows this: bash-3.00# zpool status -v pool pool: pool state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 c3t1d0s3 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: pool/export:<0x3a10e> pool/export:<0x226cb6> pool/export:<0x28cf7b> (it used to show pathnames, I promise.) With Linux ext3, I discovered these errors a year later when some .avi wouldn't play. Having pathnames is obviously great, because I can go hunting all over the Internet or my disk clutter for another copy of the corrupt file. I can do my hunting before next year when other copies of the file have become more scarce. I can feel confident the rest of the disk is in good shape, not worry I should maybe reinstall operating systems. This saves me so much time and lets me be lazy. It's good to have this accurate and sanely-displayed data about exactly which data was lost in an fsck of an unclean filesystem, rather than 'inode 98fed94 CLEARED!!!'. FFS has been keeping my valuable data since 1999, and now after a year of testing ZFS I think I will move this data onto ZFS for the next eight years.
Attachment:
pgpKuyVhCYPWN.pgp
Description: PGP signature