Re: OT - 2 of 4 drives in a Raid10 array failed - Any chance of recovery?

Greg Smith <gsmith@xxxxxxxxxxxxx> · Wed, 21 Oct 2009 02:30:35 -0400 (EDT)

On Tue, 20 Oct 2009, Craig Ringer wrote:

You made an exact image of each drive onto new, spare drives with `dd' 
or a similar disk imaging tool before trying ANYTHING, right? Otherwise, 
you may well have made things worse, particularly since you've tried to 
resync the array. Even if the data was recoverable before, it might not 
be now.

This is actually pretty hard to screw up with Linux software RAID.  It's 
not easy to corrupt a working volume by trying to add a bogus one or 
typing simple commands wrong.  You'd have to botch the drive addition 
process altogether and screw with something else to take out a good drive.

If the problem is just a few bad sectors, you can usually just
force-re-add the drives into the array and then copy the array contents
to another drive either at a low level (with dd_rescue) or at a file
system level.

This approach has saved me more than once.  On the flip side, I have also 
more than once accidentally wiped out my only good copy of the data when 
making a mistake during an attempt at stressed out heroics like this. 
You certainly don't want to wander down this more complicated path if 
there's a simple fix available within the context of the standard tools 
for array repairs.

On a side note: I'm personally increasingly annoyed with the tendency of
RAID controllers (and s/w raid implementations) to treat disks with
unrepairable bad sectors as dead and fail them out of the array.

Given how fast drives tend to go completely dead once the first error 
shows up, this is a reasonable policy in general.

Rather than failing a drive and as a result rendering the whole array 
unreadable in such situations, it should mark the drive defective, set 
the array to read-only, and start screaming for help.

The idea is great, but you have to ask just exactly how the hardware and 
software involved is supposed to enforce making the array read-only.  I 
don't think the ATA and similar command sets have that concept implemented 
in a way you can actually do this at the level it would need to happen at 
for hardware RAID to implement this idea.  Linux software RAID could keep 
you from mounting the array read/write in this situation, but the way 
errors percolate up from the disk devices to the array ones in Linux has 
too many layers in it (especially if LVM is stuck in the middle there too) 
for that to be simple to implement either.

--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general