>>>>> "cs" == Charles Sprickman <spork@xxx> writes: cs> smartmontools seconded. I actually don't pay any attention to whether the ``overall assessment'' says PASSED or not. It seems to always say PASSED. The goal is to distinguish between two different problems: 1. bad driver. bad card. bad cable. 2. bad disk. You can look at 'smartctl -a'. If the UDMA_CRC_Error_Count raw count is increasing, it's a bad cable. If the Hardware_ECC_Recovered or Seek_Error_Rate counts are increasing, it's a bad drive. Another, maybe more decisive, method: you can start 'smartctl -t long' to tell the drive to test itself. The output will tell you the ``recommended polling interval,'' which is about how long the test will take. This will be about 1 - 4 hours. smartctl returns immediately, and the drive tests itself in the background. Do this only on a drive that's not mounted. Then run 'smartctl -a'. Then run 'smartctl -a' a second time. Make sure the test is still running. Sometimes, sending the drive a command will abort the test, and 'smartctl -a' is a command---that's why you run it twice. If your tests are getting aborted you'll see something like this: Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Offline Aborted by host 70% 0 - In that case, (1) your drive's firmware is too old to work well, (2) your power supply is bad, or (3) you are trying to test a mounted drive. You could try starting the test with smartctl, then unplugging the IDE cable, and leaving the drive connected to power only for about four hours. If you can get the test to keep running, then after four hours or so, do 'smartctl -a' again, and the result of the test will show up at the bottom like this: Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 2373 - This is how the drive reports the result: by writing it to this nonvolatile log which you can check later. There is no other reporting method. This positive result shown above mostly proves the drive is good. If 'smartctl -t long' gives a good result, try another test: 'dd if=/dev/ad0 of=/dev/null bs=512'. If 'dd' reports an I/O error before the address of the end of the drive but 'smartctl -t long' reports good, that means your problem is with driver/card/cable. (it is normal on some but not all Unixes for 'dd' to get you an IDE driver error in 'dmesg' by trying to read past the end of teh disk. You need to look at the sector number of the error, and see if it's in the middle of the drive or if it's past the end.) A bad-drive result from 'smartctl -t long' should have a non-empty LBA_of_first_error like this: Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 60% 19687 204786353 I think maybe good drives connected to bad power supplies can possibly fail this 'smartctl -t long' test, but I just RMA them unconditionally when they fail and include a copy of the smartctl output. I have had bad power supply problems twice. The first time I spotted it using a scope (~600mV ripple during disk read/write activity rather than 100-200mV), and the second time by trial-and-error part-swapping.
Attachment:
pgpuGIRIuoBzX.pgp
Description: PGP signature