Re: RAID5 with 2 drive failure at the same time

Robin Hill <robin@xxxxxxxxxxxxxxx> · Thu, 31 Jan 2013 22:10:07 +0000



On Thu Jan 31, 2013 at 10:46:17 -0700, Chris Murphy wrote:

> 
> On Jan 31, 2013, at 6:15 AM, Christoph Nelles <evilazrael@xxxxxxxxxxxxx> wrote:
> 
> > All drives are available again. And the seecond failed device reports
> > UREs. I will run badblocks on that device before continuing.
> > I attached the kernel logs of the first error and of the second error. I
> > hope i filtered them reasonably.
> 
> This looks like a write error, resulting in md immediately booting the
> drive. There's little point in using this drive again.
> 
> Jan 28 00:23:36 router kernel: Write(16): 8a 00 00 00 00 01 36 b2 55 50 00 00 00 30 00 00
> Jan 28 00:23:36 router kernel: end_request: I/O error, dev sdg, sector 5212624208
> 
It's definitely a write error, yes. If there's nothing further back in
the log (e.g. a read error that's caused a rewrite to take place) then
this would definitely warn against the drive, but could just be a
transient error (or a controller problem). If there is a read error
further back then I'd blame it on timeout issues, with the drive still
trying to complete the read operation while the kernel's timed out and
trying to send a write.

> What does smartctl -a return for this drive?
> 
> 
> > Exactly. I am running badblocks on that device. SMART reports one
> > "Pending Sector Count" :(
> 
> I'm unclear on the efficacy of badblocks for testing. I'd use smartctl
> -t long and then -a to see if there are sector problems and at what
> LBA; and for removing bad blocks (force a remap) I'd use either dd
> zeros with e.g. bs=1M, or I'd use ATA Secure Erase which is faster.
> 
I don't usually bother with read tests - as you say, they're not
terribly useful. If the data's useful then just use ddrescue to get what
you can, otherwise just write-test it. I usually do a full destructive
badblocks test (I've found cases where zeros write fine but other
patterns fail), followed by a long SMART test.

> If you use the badblocks map when formatting a drive, e.g. using
> mkfs.ext4 -c, then it would allow you to use this disk but not in
> RAID. On top of raid, md gets the write error before the file system
> does, and boots the drive out of the array. Or on read error attempts
> to correct it. And even as a standalone drive do you really want to
> use a drive that can't remap future bad sectors?
> 
Not a chance I'd use it if it's actually failing to remap bad sectors,
no. Only had that with one drive so far though (out of several hundred),
most get failed out after getting more than a handful of remapped
sectors.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@xxxxxxxxxxxxxxx> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |
Attachment:
pgp18_G3shf7p.pgp

Description: PGP signature