Disaster recovery: Linux's ext2fs vs. BSD's FFS and LFS

Q. What else could cause corruption, besides disk blocks going bad?

There were few bad blocks. Subjectively it seems to me like more metadata corruption happened than one would expect based on the amount of metadata, the amount of bad blocks, and the probability of the two overlapping.

We discovered the disk was going bad because the Custodian of Audrey reported audrey crashed hard, refused to acknowledge reboot requests, would not even unblank the console, u.s.w. When a block goes bad, the small computer on the disk answers the big computer's request to read that block in a very simple, sane, consistent way: ``Block x is illedgible. I've given up after trying to read it 89 times.'' Or, ``Block x is marginal. I recovered it with error correction after 8 tries and rewrote it.'' If you read a bad disk with NetBSD, the kernel forwards these diagnostics to the console in a human-readable way and logs them onto the (marginal) disk via syslog, so I can read them and know the disk is going bad.

If you read a bad disk with Linux, the first time the disk generates an error message of any kind (probably even ``Block x is marginal,'' which is not even a proper error since the disk _was_ able to read the block), Linux assumes the disk has completely flipped out, and enters a busywait SCSI-bus-reset loop. It tells the disk's computer, ``Well, if you're complaining, then reset yourself, now.'' This takes 5-15 seconds per iteration, accomplishes nothing, and eventually crashes the kernel.

When you change a file in a filesystem, the change isn't recorded to disk immediately. This speeds things up, for example because chances are you're going to change the same disk block again in a few milliseconds. Even when everyone quits changing stuff, the physical disk doesn't catch up with the filesystem hallucination for at least one _minute_.

To see what happens with Linux, suppose there's just _one_ bad block on the disk. You read your email, delete some messages, eXpunge the folder. Linux writes part of your change to the disk, and keeps part of it in memory planning to write it later. Meanwhile, someone fingers somebody, and a line gets written to the incoming-finger-requests log. Unfortunately, the single bad block gets allocated to this log file. The kernel gets a diagnostic error from the disk while writing the finger log, enters an endless disk-RESET loop, and crashes. Eventually someone reboots audrey by pulling the plug. The rest of the changes to your email folder are never written because Linux crashed itself while it was obsessively trying to write that one block to one noncritical log file, at any cost, even with (unambiguously reported by the disk) no hope of success.

Although it was never written on any bad blocks, your email folder with the half-written metadata is completely nonsensical upon reboot, and the filesystem cleaner removes it entirely.

For a moment, forget Linux's abyssmally poor handling of SCSI errors, without which this problem would never have occurred. The kernel would never have crashed, the plug never been pulled, and any data _not_ targeted for a bad block, like your folder's metadata, would have been written eventually.

Although Linux's filesystem design makes writing to it very fast, the designers of BSD's filesystem felt this metadata-caching was unacceptably precarious situation. By elevating operating system design to a Religion, their priesthood was able to design a filesystem they could prove would meet minimum standards of consistency even if you pull the plug at any moment you like. Metadata is cached less agressively than regular data, and is written to the disk in a carefully-controlled sequence. Corruption is thus contained, at the expense of performance. High-traffic news servers based on BSD UNIX use experimental tweaks like FFS with soft updates and LFS to achieve Linux-like performance without sacrificing the original consistency guarantees (in the LFS case, even _extending_ the guarantees). audrey currently uses plain FFS. I plan to experiment with LFS on frannie.

audrey's plug was pulled several times as the disk failure got progressively worse. The disk would have been replaced sooner if I'd known it was marginal, but because Linux was unable to log any of the disk errors so I could read them later and figure out what was going on, we assumed it was yet another network DoS vulnerability. As soon as one error occurred, Linux entered the bus RESET loop and wrote nothing more to disk. When it eventually crashed, it left the console blank so the error messages preceeding the crash couldn't even be read and recorded manually.

For comparison, I have only one NetBSD box that doesn't use a serial console, and it unblanks the screen on any console output. So, if output occurs before a crash, chances are the screen isn't blanked. NetBSD also tends to panic, drop you into a debugger, and write a core dump when the kernel gets confused--hard crashes and endless busywaits are rare. Linux doesn't even _have_ a debugger to drop into, and is not capable of dumping core to the swap partition like most BSD UNIX's.

Note, since I did not inspect the contents of files (realistically, can't), I have no idea how much loss occurred _within_ files. Only metadata loss was visible during my recovery work, so you'll just have to find out how bad _data_ loss is yourself.

However, it's worth realizing that metadata loss has greater and less predictable consequences than data loss. While bad disk blocks tend to cause _data_ loss more often than metadata loss (simply because there is, by a huge factor, more data on the disk than metadata), metadata gets written more often and is thus dispropotionately likely to be lost in when an operating system that caches data blindly (like Linux) crashes.

Draw whatever conclusions you like from all this.

Q. If this problem has been well understood since the 70's, why are we still so poorly equiped to deal with it? In the words of Mr. Ramsay, ``Someone had blundered,'' right? This all seems rather pathetic.

Indeed.

Q. Windows 98 seems to crash a lot, so it would probably benefit from a robust filesystem. I understand Microsoft's Latest Product includes the Advanced FAT32 Filesystem with longer filenames and support for bigger disks. Does Windows 98 guarantee some level of metadata consistency like BSD's FFS and LFS do?

You're kidding, right?


disk recovery FAQ / map / carton's page / Miles Nordin <carton@Ivy.NET>
Last update (UTC timezone): $Id: ext2fs-vs-bsd.html,v 1.4 2004/09/08 07:38:51 carton Exp $