Re: best common practice in case of degraded array with read errors

Robin Hill <robin@xxxxxxxxxxxxxxx> · Mon, 16 Nov 2009 22:07:22 +0000



On Mon Nov 16, 2009 at 10:32:30PM +0100, Mikael Abrahamsson wrote:

> 
> Hello.
> 
> I have a 6 drive raid5. One of the drives failed on me (totally), and when 
> I replaced it (-add a new working drive) I had several sectors on another 
> drive give me UNC errors, which made md kick that drive as well, and left 
> me with a non-working array (with only 4 drives).
> 
Are you running any regular array checks?  These should verify the
readability of the drives (and accuracy of the checksums).  This type of
failure's also why I've switched to RAID6 for most of my arrays.

> What is the best common practice to handle this scenario? Right now I'm 
> dd_rescue:ing the drive with read errors to a (hopefully) working drive, 
> and then when I plan to --assemble --force the array to get 5 working 
> drives (with a few zero:ed sectors where I guess I'll have corrupted 
> files, hopefully no important metadata), and then I plan --add a 6th drive 
> and have everything sync up and be back to "normal".
> 
> Is there a better way? I don't really understand why kicking drives out of 
> the array when there aren't enough of them to keep going makes sense, is 
> there some rationale I'm missing?
> 
Technically, the best practice is probably to recreate the array from
scratch (replacing any failed drives) and restore from backup.  Short of
that, your approach would seem to be the best option.  I've done this in
the past, though I ended up restoring pretty much everything from backup
anyway (as I had no other way of verifying the integrity of the data).

> I've also heard recommendations to write to the bad sectors on the 
> existing drive, but that scares me as well in case I write to the wrong 
> place, which is why I went the dd_rescue route (I'm also hoping that it'll 
> retry a bit more and might be able to read the bad blocks...)
> 
I'd leave that to later.  Once you've imaged the disk you can try
SMART tests, read/write tests, etc. to verify whether there's actually a
physical problem or not (and how much of one - a bad block or two might
be acceptable, but a lot of them would point to a failing disk).  Until
then you're better putting as little trust in the disk as possible.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@xxxxxxxxxxxxxxx> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |
Attachment:
pgptjZr9ccqzh.pgp

Description: PGP signature