Re: Failed Raid 5 due to OS mbr written on one of the array drives

Phil Turmel <philip@xxxxxxxxxx> · Mon, 2 Nov 2015 13:19:23 -0500

Good afternoon Nicolas,

On 11/02/2015 12:02 PM, Nicolas Tellier wrote:
> I'm looking for advices regarding a failed (2 disc down) Raid 5 array
> with 4 disks. It was running in a NAS for quite some time but after I
> came back from a trip, I found out that the system disk was dead.
> After replacing the drive, I reinstalled the OS (OpenMediaVault for
> the curious). Sadly the mbr was written to one of the raid disk
> instead of the OS one. This would not have been too critical if, after
> booting the system, I didn't realized that the array was already
> running in degraded mode prior to the OS disk problem. Luckily I have
> a backup for most of the critical data on that array. There is nothing
> that I cannot replace, but it would still be quite inconvenient. I
> guess it's a good opportunity to learn more about raid :)

Very good.  Many people don't understand that RAID != Backup.

>  mdadm --stop /dev/md0 and removing the boot flag from the wrongly
> written raid disk are the only stuff that I did so far.

Good.

[trim /]

> Now, /dev/sda is more interesting. The partition is still present and
> looks intact, it seems like it's just missing the superblock because
> of the mbr shenanigan. Also the two healthy drives still see it as
> active.

The superblock isn't anywhere near the MBR, so something more than just
grub-install must have happened.  Be prepared to have lost more of
/dev/sda1 than you've yet discovered.

> After looking around on the internet, I found people suggesting to
> re-create the raid. It seems a bit extreme to me, but I cannot find
> any other solution…

Unfortunately, yes.  You are past the point of a forced assembly or
other normal operations.

> Luckily I saved the original command used to create this array. Here
> is the one I think would be relevant in this case :
> 
> mdadm --create --verbose --assume-clean /dev/md0 --level=5
> --metadata=1.2 --chunk=128 --raid-devices=4 /dev/sda1 /dev/sdb1
> /dev/sdc1 missing /dev/sdd

1) Leave off /dev/sdd
2) Include --data-offset=262144

If it runs and you can access the filesystem (I suspect not), set up a
partition on sdd and add that to your array.  Use zero-superblock to
blow away the stale superblock on sdd.

If it doesn't work due to more sda1 damage and you end up starting over,
consider wiping all of those partition tables and creating the new array
on all bare devices.  That'll minimize the chance of a misidentified
device later.

You'll want to investigate the health of sdd.  If it's healthy, then its
drop-out must have been for some other reason.  You'll want to ensure
that doesn't happen again.

Consider browsing the archives and/or subscribing to learn the many
pitfalls for the unwary (hint: try searching for "timeout mismatch").

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html