On Tue, 20 Oct 2009, Craig Ringer wrote:
You made an exact image of each drive onto new, spare drives with `dd' or a similar disk imaging tool before trying ANYTHING, right? Otherwise, you may well have made things worse, particularly since you've tried to resync the array. Even if the data was recoverable before, it might not be now.
This is actually pretty hard to screw up with Linux software RAID. It's not easy to corrupt a working volume by trying to add a bogus one or typing simple commands wrong. You'd have to botch the drive addition process altogether and screw with something else to take out a good drive.
If the problem is just a few bad sectors, you can usually just force-re-add the drives into the array and then copy the array contents to another drive either at a low level (with dd_rescue) or at a file system level.
This approach has saved me more than once. On the flip side, I have also more than once accidentally wiped out my only good copy of the data when making a mistake during an attempt at stressed out heroics like this. You certainly don't want to wander down this more complicated path if there's a simple fix available within the context of the standard tools for array repairs.
On a side note: I'm personally increasingly annoyed with the tendency of RAID controllers (and s/w raid implementations) to treat disks with unrepairable bad sectors as dead and fail them out of the array.
Given how fast drives tend to go completely dead once the first error shows up, this is a reasonable policy in general.
Rather than failing a drive and as a result rendering the whole array unreadable in such situations, it should mark the drive defective, set the array to read-only, and start screaming for help.
The idea is great, but you have to ask just exactly how the hardware and software involved is supposed to enforce making the array read-only. I don't think the ATA and similar command sets have that concept implemented in a way you can actually do this at the level it would need to happen at for hardware RAID to implement this idea. Linux software RAID could keep you from mounting the array read/write in this situation, but the way errors percolate up from the disk devices to the array ones in Linux has too many layers in it (especially if LVM is stuck in the middle there too) for that to be simple to implement either.
-- * Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general