Re: How to recover after md crash during reshape?

Neil Brown <nfbrown@xxxxxxxxxx> · Wed, 21 Oct 2015 12:35:04 +1100

andras@xxxxxxxxxxxxxxxx writes:

Phil has provided lots of useful advice, I'll just add a couple of
clarifications;

>
>      mdadm --grow --raid-devices=10 /dev/md1
>
> Yes, I was dumb enough to start the process without a backup option - 
> (copy-paste error from https://raid.wiki.kernel.org/index.php/Growing).

Nothing dumb about that - you don't need a --backup option.
If you did, mdadm would have complained.

You only need --backup when the size of the array is unchanged or
decreasing.
(or when growing to a degraded array.  e.g. you can reshape a 4-drive
 raid5 to a degraded 5-drive raid5 without adding a spare.  This will
 required a --backup.  I'm fairly sure it also requires --force because
 it is a very strange thing to do).

When reshaping it a larger array, mdadm only requires a backup while
reshaping the first few stripes, and it uses some space in one of the
new (previously spare) devices to store that backup.

>
> This immediately (well, after 2 seconds) crashed the MD driver:
>
>      Oct 17 17:30:27 bazsalikom kernel: [7869821.514718] sd 0:0:0:0: 
> [sdj] Attached SCSI disk
>      Oct 17 18:39:21 bazsalikom kernel: [7873955.418679]  sdh: sdh1
>      Oct 17 18:39:37 bazsalikom kernel: [7873972.155084]  sdi: sdi1
>      Oct 17 18:39:49 bazsalikom kernel: [7873983.916038]  sdj: sdj1
>      Oct 17 18:40:33 bazsalikom kernel: [7874027.963430] md: bind<sdh1>
>      Oct 17 18:40:34 bazsalikom kernel: [7874028.263656] md: bind<sdi1>
>      Oct 17 18:40:34 bazsalikom kernel: [7874028.361112] md: bind<sdj1>
>      Oct 17 18:59:48 bazsalikom kernel: [7875182.667815] md: reshape of 
> RAID array md1
>      Oct 17 18:59:48 bazsalikom kernel: [7875182.667818] md: minimum 
> _guaranteed_  speed: 1000 KB/sec/disk.
>      Oct 17 18:59:48 bazsalikom kernel: [7875182.667821] md: using 
> maximum available idle IO bandwidth (but not more than 200000 KB/sec) 
> for reshape.
>      Oct 17 18:59:48 bazsalikom kernel: [7875182.667831] md: using 128k 
> window, over a total of 1465135936k.
> --> Oct 17 18:59:50 bazsalikom kernel: [7875184.326245] md: md_do_sync() 
> got signal ... exiting

This is very strange ... maybe some messages missing?
Probably an IO error while writing to a new device.

>
>  From here on, things went downhill pretty damn fast. I was not able to 
> unmount the file-system, stop or re-start the array (/proc/mdstat went 
> away), any process trying to touch /dev/md1 hung, so eventually, I run 
> out of options and hit the reset button on the machine.
>
> Upon reboot, the array wouldn't assemble, it was complaining that SDA 
> and SDA1 had the same superblock info on it.
>
> mdadm: WARNING /dev/sda and /dev/sda1 appear to have very similar 
> superblocks.
>        If they are really different, please --zero the superblock on one
>        If they are the same or overlap, please remove one from the
>        DEVICE list in mdadm.conf.

It's very hard to make messages like this clear without being incredibly
verbose...

In this case /dev/sda and /dev/sda1 obviously overlap (that is obvious,
isn't it?).
So in that case you need to remove one of them from the DEVICE list.
You probably don't have a DEVICE list so it defaults to everything listed in
/proc/partitions.
The "correct" thing to do at this point would have been to add a DEVICE
list to mdadm.conf which only listed the devices that might be part of
an array. e.g.

  DEVICE /dev/sd[a-z][1-9]

> So, if I read this right, the superblock here states that the array is 
> in the middle of a reshape from 7 to 10 devices, but it just started 
> (4096 is the position).
> What's interesting is the device names listed here don't match the ones 
> reported by /proc/mdstat, and are actually incorrect. The right 
> partition numbers are in /proc/mdstat.
>
> The superblocks on the 6 other original disks match, except for of 
> course which one they mark as 'this' and the checksum.
>
> I've read in here (http://ubuntuforums.org/showthread.php?t=2133576) 
> among many other places that it might be possible to recover the data on 
> the array by trying to re-create it to the state before the re-shape.
>
> I've also read that if I want to re-create an array in read-only mode, I 
> should re-create it degraded.
>
> So, what I thought I would do is this:
>
>      mdadm --create /dev/md1 --level=6 --raid-devices=7 /dev/sdh2 
> /dev/sdf2 /dev/sdi1 /dev/sdg1 /dev/sde1 missing missing

Phil has given good advice on this point which is worth following.
It is quite possible that there will still be corruption.

mdadm reads the first few stripes and stores them somewhere in each of
the spares.  md (in the kernel) then reads those stripes again and
writes them out in the new configuration.  It appears that one of the
writes failed, others might have succeeded.  This may not have corrupted
anything (the first few blocks are in the same position for both the old
and new layout) but it might have done.

So if the filesystem seems corrupt after the array is re-created, that
is likely the reason.
The data still exists in the backup on those new devices (if you haven't
done anything to them) and could be restored.

If you do want to look for the backup, it is around about the middle of
the device and has some metadata which contains the string
"md_backup_data-1".  If you find that, you are close to getting the
backup data back.

NeilBrown
Attachment:
signature.asc

Description: PGP signature