Re: Seeking help to get a failed RAID5 system back to life

Robin Hill <robin@xxxxxxxxxxxxxxx> · Fri, 29 Aug 2014 10:10:19 +0100

On Fri Aug 29, 2014 at 10:55:53AM +0200, Fabio Bacigalupo wrote:

> Hello Robin,
> 
> thank you for your feedback!
> 
> 2014-08-29 9:46 GMT+02:00 Robin Hill <robin@xxxxxxxxxxxxxxx>:
> > That's a disaster waiting to happen. You should never leave a RAID array
> > in a degraded state for any longer than is absolutely necessary,
> > otherwise you might as well not bother running RAID at all.
> 
> >> I could gather the following information:
> 
> > Are the above --examine results from before or after the replacement?
> 
> I took them before the replacement.
> 
I suspected as such.

> > Was the old /dev/sdc data replicated onto the replacement disk?
> 
> No, that is, not, yet. Luckily the guys in the data center kept the disk.
> 
If you'd had the third disk in the array in the first place then you
could have just added the new disk to the array and left it to rebuild
the data, but with it already in a degraded state then you absolutely
need that data off the second disk.

> > If the initial --examine results were done on the same disks as the
> > --assemble then I'm rather confused as to why mdadm would find a
> > superblock for one and not for the other. Could you post the mdadm and
> > kernel versions - possibly there's a bug that's been fixed in newer
> > releases.
> 
> There will be no bug. I just was under a false assumption.
> 
> > If the --examine was on the old disk and this wasn't replicated onto the
> > new one then I'm not sure what you're expecting to happen here - you've
> > lost 2 disks in a 3-disk RAID-5 so your data is now toast.
> 
> Ok, now that is clear. I will use ddrescue to replicate the old disk
> to the new one and try again.
> 
You'll need to use --assemble --force in order to get the array going
again afterwards (as the event counts are different on the two disks).
If there are any blocks that couldn't be read by ddrescue then you'll
also need to run a fsck on the array after assembly to deal with any
resulting corruption - this may affect file data, directory metadata or
may just be in unused parts of the disk (if you're really lucky).

I'd definitely recommend adding the third disk back into the array
afterwards though, and making sure regular checks are run on the array
(echo check > /sys/block/mdX/md/sync_action) to pick up any disk errors
or sync issues before they cause major problems.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@xxxxxxxxxxxxxxx> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |
Attachment:
signature.asc

Description: Digital signature