Re: RAID5 with 2 drive failure at the same time

Robin Hill <robin@xxxxxxxxxxxxxxx> · Fri, 1 Feb 2013 13:34:55 +0000



On Thu Jan 31, 2013 at 03:40:00PM -0700, Chris Murphy wrote:

> 
> On Jan 31, 2013, at 3:10 PM, Robin Hill <robin@xxxxxxxxxxxxxxx> wrote:
> 
> > If there is a read error
> > further back then I'd blame it on timeout issues, with the drive still
> > trying to complete the read operation while the kernel's timed out and
> > trying to send a write.
> 
> I think we need the whole log for the time before the start of the
> error1.txt file provided previously. And also I'd like to know which
> /dev/ device was the first to have a problem, that instigated the
> rebuild. And if during the rebuild if the file system was mounted rw,
> and if any writes were done at all. If so, that probably nixes
> --assume-clean. If it was rebuilding and not written to from the file
> system, the disk being rebuilt shouldn't actually be out of sync with
> the array state.
> 
The timestamps on the logs show that sdg was the first to have a
problem. It'd also be useful to know whether sdg has been rewritten at
all since then (i.e. whether the testing was destructive or not), and
whether or not the array was written to at all since the failure of sdg.

> The disk that needs spot sector repairs is the one with UREs, I think
> that's sdj1. If that disk is dd'd to another disk, the new disk won't
> produce UREs for sectors missing data, and the chunks comprised of
> those sectors won't get rebuilt by md.
> 
> So the disk to possibly dd to another is the one with the write error,
> sdg1. But only if the idea is to not use --assume-clean. That way a
> reassemble can rebuild, and not encounter another write error on that
> drive.
> 
Yes, if sdg still contains valid array data (and the array wasn't
written since then) then it would definitely make more sense to recreate
the array using it, leaving sdj out for now. That'll require more work
checking mdadm versions and data offset values though. That'll avoid the
issues with the unreadable blocks on sdj.

> > Not a chance I'd use it if it's actually failing to remap bad sectors,
> > no. Only had that with one drive so far though (out of several hundred),
> > most get failed out after getting more than a handful of remapped
> > sectors.
> 
> I think I see a use case for badblocks destructive writes if the disk
> doesn't support enhanced secure erase (which writes a pattern not just
> zeros). Of on laptops where it's not possible to get a disk to reset
> on sleep, allowing it to be unfrozen for the purposes of using secure
> erase. But if available, secure erase is faster and wipes all sectors
> even those without LBAs. For sure with SSDs it's what should be used.
> 
I prefer badblocks myself - I can see exactly what it's doing and what
errors are seen. With secure erase you're dependent on the firmware
internals to tell you what's actually going on (and, depending on the
nature of the errors you're getting, this may already be suspect).

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@xxxxxxxxxxxxxxx> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |
Attachment:
pgpsp8bWYOITd.pgp

Description: PGP signature