Re: Degraded Raid6

Robin Hill <robin@xxxxxxxxxxxxxxx> · Thu, 13 Dec 2012 09:52:37 +0000

On Wed Dec 12, 2012 at 09:49:07PM +0100, Bernd Waage wrote:

> Hello all,
> 
> I run an 8-disk raid 6 on which sporadically 2 drives dropped out,
> that I could just re-add when I zeroed the zuperblock beforehand.
> Recently, upon re-adding those 2 drives (after the zero-superblock) a
> third drive dropped out after 5-10 minutes of syncing. I then did a
> zero-superblock on the third drive and tried to re-add it - which
> failed.
> 
Firstly, drives sporadically dropping out of the array should _never_
just be ignored. You have a problem with your setup which needs fixing.
If the drives are actually okay (run SMART and full badblocks tests on
them) then it's probably a controller issue. I used to have a similar
issue on one of my servers and fixed it by moving the drives off the
onboard SATA controller and onto a proper SAS/SATA controller card.
Alternately, it may be the cables, power supply, or input power
fluctuations.

> I'm pretty much at my wits' end and stumbled upon this list. Perhaps
> someone of you guys can help me out. I'm running an ubuntu 12.04 box
> with kernel 3.3.8, so I should not be affected by the kernel-bug that
> popped up some time ago.
> 
> I append the output of mdadm --detail as well as mdadm --examine...
> 
They're all using the same data offset anyway, which is good. You do
need to check the mdadm version though as versions 3.2.4 and above use a
different data offset (as do versions prior to 3.0).

I'd also recommend checking the drives before proceeding - full SMART
tests and read-only badblocks tests on each drive should find any
issues (if there are any then you'll need to get replacements and clone
the old ones).

You'll then need to recreate the array, using exactly the same
parameters as for the original array. From the looks of it, that should
be:
    mdadm -C /dev/md0 -l 6 -e 1.2 -n 8 -c 4096 /dev/sdf1 /dev/sdd1 \
        missing missing /dev/sdi1 /dev/sdc1 /dev/sdb1 missing

One of those "missing" values should be replaced with the drive that
originally was in that slot, but you've not provided that information.
The output from dmesg should show which drives failed when, and where
they were in the array. If your rebuild was using the drives in the same
order as they were before the first failure then any drive will be okay
to use as they should all have the correct information (though you'd be
better avoiding the one with the read error), otherwise you'll have to
use the last one that failed.

Of course, the easiest option would be to start from scratch, test all
the drives, create a new array, and restore the data from backup. I'm
guessing you don't have a backup though.

Good luck,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@xxxxxxxxxxxxxxx> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |
Attachment:
pgpPfXnOVPuVg.pgp

Description: PGP signature