Re: Failed Raid 5 due to OS mbr written on one of the array drives

Nicolas Tellier <telliern@xxxxxxxxx> · Tue, 3 Nov 2015 15:21:19 +0100

Hi Phil, and thanks for taking the time to reply to me.

I'm not sure I understood you advice correctly.
When you say leave off /dev/sdd, do you mean I should recreate a 3
disks array like this :
mdadm --create --verbose --assume-clean /dev/md0 --level=5
--metadata=1.2 --chunk=128 --data-offset=262144 --raid-devices=3
/dev/sda1 /dev/sdb1 /dev/sdc1

Or a 4 disks array with no mention of /dev/sdd, like this (my instinct
tell me that wouldn't work) :
mdadm --create --verbose --assume-clean /dev/md0 --level=5
--metadata=1.2 --chunk=128 --data-offset=262144 --raid-devices=4
/dev/sda1 /dev/sdb1 /dev/sdc1

Or, as I was thinking originally, to put /dev/sdd as missing, like so :
mdadm --create --verbose --assume-clean /dev/md0 --level=5
--metadata=1.2 --chunk=128 --data-offset=262144 --raid-devices=4
/dev/sda1 /dev/sdb1 /dev/sdc1 missing /dev/sdd

In the eventuality of having to start back from scratch, I'll be sure
to clean everything (zero-superblock and all) and check the health of
all the devices. Raid 6 might be more suited to withstand some of my
episodic stupidity :)

I'm tempted to subscribe to the mailing list, but I'm afraid of the
volume of e-mail I'm going to get. I originally tried to search
through the archives, but couldn't find anything relevant to my case.
I'm looking right now at the "timeout mismatch" results.

Thanks again.

Nico

2015-11-02 19:19 GMT+01:00 Phil Turmel <philip@xxxxxxxxxx>:
> Good afternoon Nicolas,
>
> On 11/02/2015 12:02 PM, Nicolas Tellier wrote:
>> I'm looking for advices regarding a failed (2 disc down) Raid 5 array
>> with 4 disks. It was running in a NAS for quite some time but after I
>> came back from a trip, I found out that the system disk was dead.
>> After replacing the drive, I reinstalled the OS (OpenMediaVault for
>> the curious). Sadly the mbr was written to one of the raid disk
>> instead of the OS one. This would not have been too critical if, after
>> booting the system, I didn't realized that the array was already
>> running in degraded mode prior to the OS disk problem. Luckily I have
>> a backup for most of the critical data on that array. There is nothing
>> that I cannot replace, but it would still be quite inconvenient. I
>> guess it's a good opportunity to learn more about raid :)
>
> Very good.  Many people don't understand that RAID != Backup.
>
>>  mdadm --stop /dev/md0 and removing the boot flag from the wrongly
>> written raid disk are the only stuff that I did so far.
>
> Good.
>
> [trim /]
>
>> Now, /dev/sda is more interesting. The partition is still present and
>> looks intact, it seems like it's just missing the superblock because
>> of the mbr shenanigan. Also the two healthy drives still see it as
>> active.
>
> The superblock isn't anywhere near the MBR, so something more than just
> grub-install must have happened.  Be prepared to have lost more of
> /dev/sda1 than you've yet discovered.
>
>> After looking around on the internet, I found people suggesting to
>> re-create the raid. It seems a bit extreme to me, but I cannot find
>> any other solution…
>
> Unfortunately, yes.  You are past the point of a forced assembly or
> other normal operations.
>
>> Luckily I saved the original command used to create this array. Here
>> is the one I think would be relevant in this case :
>>
>> mdadm --create --verbose --assume-clean /dev/md0 --level=5
>> --metadata=1.2 --chunk=128 --raid-devices=4 /dev/sda1 /dev/sdb1
>> /dev/sdc1 missing /dev/sdd
>
> 1) Leave off /dev/sdd
> 2) Include --data-offset=262144
>
> If it runs and you can access the filesystem (I suspect not), set up a
> partition on sdd and add that to your array.  Use zero-superblock to
> blow away the stale superblock on sdd.
>
> If it doesn't work due to more sda1 damage and you end up starting over,
> consider wiping all of those partition tables and creating the new array
> on all bare devices.  That'll minimize the chance of a misidentified
> device later.
>
> You'll want to investigate the health of sdd.  If it's healthy, then its
> drop-out must have been for some other reason.  You'll want to ensure
> that doesn't happen again.
>
> Consider browsing the archives and/or subscribing to learn the many
> pitfalls for the unwary (hint: try searching for "timeout mismatch").
>
> Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html