Re: Failed during rebuild (raid5)

Robin Hill <robin@xxxxxxxxxxxxxxx> · Fri, 3 May 2013 13:40:23 +0100



On Fri May 03, 2013 at 12:38:47PM +0100, Benjamin ESTRABAUD wrote:

> On 03/05/13 12:23, Andreas Boman wrote:
> > I have (had?) a raid 5 array, with 5 disks (1.5TB/EA), smartd warned 
> > one was getting bad, so I replaced it with an identical disk.
> > I issued mdadm --manage --add /dev/md127 /dev/sdX
> >
> > The array seemed to be rebuilding, was at around 15% when I went to bed.
> >
> > This morning I came up to see the array degraded with two missing 
> > drives, another failed during the rebuild.
> >
> > I powered the system down, and since I have the disk smartd flagged as 
> > bad and tried to just plug that in and power up hoping to see the 
> > array come back up -no such luck (not enough disks).
> >
> Unfortunately this happens way too often: Your RAID members silently 
> fail over time. They will get some bad blocks, and you won't know about 
> it until you try to read or write one of the bad blocks. When that 
> happen a disk will get kicked out. At this stage you'll replace the 
> disk, not knowing that other areas of the other RAID members have also 
> failed. The only sensible option is to run a RAID 6 which dramatically 
> reduces the potential for double failure, or to run a RAID 5 but run a 
> weekly (at least) check of the entire array for badblocks, carefully 
> monitoring the smart reported changes after running the test (trying to 
> read the entire array will cause badblocks to be detected and 
> reallocated if any).
>
I'd recommend running a regular check anyway, whatever RAID level you're
running. I wouldn't choose to run RAID5 with more than 5 disks, or
larger than 1TB members either, but that really comes down to an
individual's cost/benefit view on the situation.

> > I powered the system down again, and now I'm trying to evaluate my 
> > best options to recover. Hoping to have some good advice in my inbox 
> > when I get back from the office. I'll be able to boot the thing up and 
> > get log info this afternoon.
> >
> I had to recover an array like that twice. The most important is 
> probably to mitigate the data loss on the second drive that is failing 
> right now by "ddrescueing" all of its data on another drive before it 
> gets more damaged (the longer the failing drive is online the less 
> chance you have). Use GNU ddrescue for that purpose.
> 
> Once you have rescued the failing drive onto a new one, you could then 
> try to add that new recovered drive in place of the failing one and 
> start the resync as you did before.
> 
Not "add", but "force assemble using". Whenever you end up in a
situation with too many failed disks to start your array, ddrescue and
force assemble (mdadm -Af) is your safest option. If that fails you can
look at more extreme measures, but that should always be the first
approach.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@xxxxxxxxxxxxxxx> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |
Attachment:
signature.asc

Description: Digital signature