Re: Likely forced assemby with wrong disk during raid5 grow. Recoverable?

NeilBrown <neilb@xxxxxxx> · Sun, 20 Feb 2011 16:25:09 +1100

On Sun, 20 Feb 2011 04:23:09 +0100 Claude Nobs <claudenobs@xxxxxxxxx> wrote:

> Hi All,
> 
> I was wondering if someone might be willing to share if this array is
> recoverable.
> 

Probably is.  But don't do anything yet - any further action until you have
read all of the following email, will probably cause more harm than good.

> I had a clean, running raid5 using 4 block devices (two of those were
> 2 disk raid0 md devices) in RAID 5. Last night I decided it was safe
> to grow the array by one disk. But then a) a disk failed, b) a power
> loss occured, c) i probably switched the wrong disk and forced
> assembly, resulting in an inconsistent state. Here is a complete set
> of actions taken :

Providing this level of information is excellent!

> 
> > bernstein@server:~$ sudo mdadm --grow --raid-devices=5 --backup-file=/raid.grow.backupfile /dev/md2
> > mdadm: Need to backup 768K of critical section..
> > mdadm: ... critical section passed.
> > bernstein@server:~$ cat /proc/mdstat
> > Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
> > md1 : active raid0 sdg1[1] sdf1[0]
> >       976770944 blocks super 1.2 64k chunks
> >
> > md2 : active raid5 sda1[5] md0[4] md1[3] sdd1[1] sdc1[0]
> >       2930281920 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
> >       [>....................]  reshape =  1.6% (16423164/976760640) finish=902.2min speed=17739K/sec
> >
> > md0 : active raid0 sdh1[0] sdb1[1]
> >       976770944 blocks super 1.2 64k chunks
> >
> > unused devices: <none>

All looks good so-far.

> 
> 
> now i thought /dev/sdg1 failed. unfortunately i have no log for this
> one, just my memory of seeing this changed to the one above :
> 
> >       2930281920 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/5] [UU_UU]
> 

Unfortunately it is not possible to know which drive is missing from the
above info.  The [numbers] is brackets don't exactly corresponds to the
positions in the array that you might thing they do.  The mdstat listing above
has numbers 0,1,3,4,5.

They are the 'Number' column in the --detail output below.  This is /dev/md1
- I can tell from the --examine outputs, but it is a bit confusing.  Newer
versions of mdadm make this a little less confusing.  If you look for
patterns of U and u  in the 'Array State' line, the U is 'this device', the
'u' is some other devices.

So /dev/md1 had a failure, so it could well have been sdg1.

> some 10 minutes later a power loss occurred, thanks to an ups the
> server shut down as with 'shutdown -h now'. now i exchanged /dev/sdg1,
> rebooted and in a lapse of judgement forced assembly:

Perfect timing :-)

> 
> > bernstein@server:~$ sudo mdadm --assemble --run /dev/md2 /dev/md0 /dev/sda1 /dev/sdc1 /dev/sdd1
> > mdadm: Could not open /dev/sda1 for write - cannot Assemble array.
> > mdadm: Failed to restore critical section for reshape, sorry.

This isn't actually a 'forced assembly' as you seem to think.  There is no
'-f' or '--force'.  It didn't cause any harm.

> >
> > bernstein@server:~$ sudo mdadm --detail /dev/md2
> > /dev/md2:
> >         Version : 01.02
> >   Creation Time : Sat Jan 22 00:15:43 2011
> >      Raid Level : raid5
> >   Used Dev Size : 976760640 (931.51 GiB 1000.20 GB)
> >    Raid Devices : 5
> >   Total Devices : 3
> > Preferred Minor : 3
> >     Persistence : Superblock is persistent
> >
> >     Update Time : Sat Feb 19 22:32:04 2011
> >           State : active, degraded, Not Started
                                        ^^^^^^^^^^^^

mdadm has put the devices together as best it can, but has not started the
array because it didn't have enough devices.  This is good.

> >  Active Devices : 3
> > Working Devices : 3
> >  Failed Devices : 0
> >   Spare Devices : 0
> >
> >          Layout : left-symmetric
> >      Chunk Size : 64K
> >
> >   Delta Devices : 1, (4->5)
> >
> >            Name : master:public
> >            UUID : c3b6db19:b61c3ba9:0a74b12b:3041a523
> >          Events : 133609
> >
> >     Number   Major   Minor   RaidDevice State
> >        0       8       33        0      active sync   /dev/sdc1
> >        1       0        0        1      removed
> >        2       0        0        2      removed
> >        4       9        0        3      active sync   /dev/block/9:0
> >        5       8        1        4      active sync   /dev/sda1

Some you now have 2 devices missing.  Along as we can find the devices, 
  mdadm --assemble --force
should be able to put them togethe for you.  But let's see  what we have...

> 
> so i reattached the old disk, got /dev/md1 back and did the
> investigation i should have done before :
> 
> > bernstein@server:~$ sudo mdadm --examine /dev/sdd1
> > /dev/sdd1:
> >           Magic : a92b4efc
> >         Version : 1.2
> >     Feature Map : 0x4
> >      Array UUID : c3b6db19:b61c3ba9:0a74b12b:3041a523
> >            Name : master:public
> >   Creation Time : Sat Jan 22 00:15:43 2011
> >      Raid Level : raid5
> >    Raid Devices : 5
> >
> >  Avail Dev Size : 1953521392 (931.51 GiB 1000.20 GB)
> >      Array Size : 7814085120 (3726.05 GiB 4000.81 GB)
> >   Used Dev Size : 1953521280 (931.51 GiB 1000.20 GB)
> >     Data Offset : 272 sectors
> >    Super Offset : 8 sectors
> >           State : clean
> >     Device UUID : 5e37fc7c:50ff3b50:de3755a1:6bdbebc6
> >
> >   Reshape pos'n : 489510400 (466.83 GiB 501.26 GB)
> >   Delta Devices : 1 (4->5)
> >
> >     Update Time : Sat Feb 19 22:23:09 2011
> >        Checksum : fd0c1794 - correct
> >          Events : 133567
> >
> >          Layout : left-symmetric
> >      Chunk Size : 64K
> >
> >     Array Slot : 1 (0, 1, failed, 2, 3, 4)
> >    Array State : uUuuu 1 failed

This device thinks all is well.  The "1 failed" is misleading.  The
   uUuuu
patterns says that all the devices are though to be working.
Note for later reference:
         Events: 133567
 Reshape pos'n : 489510400

> > bernstein@server:~$ sudo mdadm --examine /dev/sda1
> > /dev/sda1:
> >           Magic : a92b4efc
> >         Version : 1.2
> >     Feature Map : 0x4
> >      Array UUID : c3b6db19:b61c3ba9:0a74b12b:3041a523
> >            Name : master:public
> >   Creation Time : Sat Jan 22 00:15:43 2011
> >      Raid Level : raid5
> >    Raid Devices : 5
> >
> >  Avail Dev Size : 1953521392 (931.51 GiB 1000.20 GB)
> >      Array Size : 7814085120 (3726.05 GiB 4000.81 GB)
> >   Used Dev Size : 1953521280 (931.51 GiB 1000.20 GB)
> >     Data Offset : 272 sectors
> >    Super Offset : 8 sectors
> >           State : clean
> >     Device UUID : baebd175:e4128e4c:f768b60f:4df18f77
> >
> >   Reshape pos'n : 502815488 (479.52 GiB 514.88 GB)
> >   Delta Devices : 1 (4->5)
> >
> >     Update Time : Sat Feb 19 22:32:04 2011
> >        Checksum : 12c832c6 - correct
> >          Events : 133609
> >
> >          Layout : left-symmetric
> >      Chunk Size : 64K
> >
> >     Array Slot : 5 (0, failed, failed, failed, 3, 4)
> >    Array State : u__uU 3 failed

This device thinks devices 1 and 2 have failed (the '_'s).
So 'sdd1' above, and and md1.
        Events : 133609 - this has advanced a bit from sdd1
 Reshape Pos'n : 502815488 - this has advanced quite a lot.

> > bernstein@server:~$ sudo mdadm --examine /dev/sdc1
> > /dev/sdc1:
> >           Magic : a92b4efc
> >         Version : 1.2
> >     Feature Map : 0x4
> >      Array UUID : c3b6db19:b61c3ba9:0a74b12b:3041a523
> >            Name : master:public
> >   Creation Time : Sat Jan 22 00:15:43 2011
> >      Raid Level : raid5
> >    Raid Devices : 5
> >
> >  Avail Dev Size : 1953521392 (931.51 GiB 1000.20 GB)
> >      Array Size : 7814085120 (3726.05 GiB 4000.81 GB)
> >   Used Dev Size : 1953521280 (931.51 GiB 1000.20 GB)
> >     Data Offset : 272 sectors
> >    Super Offset : 8 sectors
> >           State : clean
> >     Device UUID : 82f5284a:2bffb837:19d366ab:ef2e3d94
> >
> >   Reshape pos'n : 502815488 (479.52 GiB 514.88 GB)
> >   Delta Devices : 1 (4->5)
> >
> >     Update Time : Sat Feb 19 22:32:04 2011
> >        Checksum : 8aa7d094 - correct
> >          Events : 133609
> >
> >          Layout : left-symmetric
> >      Chunk Size : 64K
> >
> >     Array Slot : 0 (0, failed, failed, failed, 3, 4)
> >    Array State : U__uu 3 failed

 Reshape pos'n, Events, and Array State are identical to sda1.
So these two are in agreement.

> > bernstein@server:~$ sudo mdadm --examine /dev/md0
> > /dev/md0:
> >           Magic : a92b4efc
> >         Version : 1.2
> >     Feature Map : 0x4
> >      Array UUID : c3b6db19:b61c3ba9:0a74b12b:3041a523
> >            Name : master:public
> >   Creation Time : Sat Jan 22 00:15:43 2011
> >      Raid Level : raid5
> >    Raid Devices : 5
> >
> >  Avail Dev Size : 1953541616 (931.52 GiB 1000.21 GB)
> >      Array Size : 7814085120 (3726.05 GiB 4000.81 GB)
> >   Used Dev Size : 1953521280 (931.51 GiB 1000.20 GB)
> >     Data Offset : 272 sectors
> >    Super Offset : 8 sectors
> >           State : clean
> >     Device UUID : 83ecd60d:f3947a5e:a69c4353:3c4a0893
> >
> >   Reshape pos'n : 502815488 (479.52 GiB 514.88 GB)
> >   Delta Devices : 1 (4->5)
> >
> >     Update Time : Sat Feb 19 22:32:04 2011
> >        Checksum : 1bbf913b - correct
> >          Events : 133609
> >
> >          Layout : left-symmetric
> >      Chunk Size : 64K
> >
> >     Array Slot : 4 (0, failed, failed, failed, 3, 4)
> >    Array State : u__Uu 3 failed

again, exactly the same as sda1 and sdc1.

> > bernstein@server:~$ sudo mdadm --examine /dev/md1
> > /dev/md1:
> >           Magic : a92b4efc
> >         Version : 1.2
> >     Feature Map : 0x4
> >      Array UUID : c3b6db19:b61c3ba9:0a74b12b:3041a523
> >            Name : master:public
> >   Creation Time : Sat Jan 22 00:15:43 2011
> >      Raid Level : raid5
> >    Raid Devices : 5
> >
> >  Avail Dev Size : 1953541616 (931.52 GiB 1000.21 GB)
> >      Array Size : 7814085120 (3726.05 GiB 4000.81 GB)
> >   Used Dev Size : 1953521280 (931.51 GiB 1000.20 GB)
> >     Data Offset : 272 sectors
> >    Super Offset : 8 sectors
> >           State : clean
> >     Device UUID : 3c7e2c3f:8b6c7c43:a0ce7e33:ad680bed
> >
> >   Reshape pos'n : 502809856 (479.52 GiB 514.88 GB)
> >   Delta Devices : 1 (4->5)
> >
> >     Update Time : Sat Feb 19 22:30:29 2011
> >        Checksum : 6c591e90 - correct
> >          Events : 133603
> >
> >          Layout : left-symmetric
> >      Chunk Size : 64K
> >
> >     Array Slot : 3 (0, failed, failed, 2, 3, 4)
> >    Array State : u_Uuu 2 failed

And here is md1.  Thinks device 2 - sdd1 - has failed.
        Events : 133603 - slightly behind the 3 good devices, be well after
                                                  sdd1
 Reshape Pos'n : 502809856 - just a little before the 3 good devices too.

> 
> so obviously not /dev/sdd1 failed. however (due to that silly forced
> assembly?!) the reshape pos'n field of md0, sd[ac]1 differs from md1 a
> few bytes, resulting in an inconsistent state...

The way I read it is:

  sdd1 failed first - shortly after Sat Feb 19 22:23:09 2011 - the update
                      time on sdd1
reshape continued until some time between Sat Feb 19 22:30:29 2011
and Sat Feb 19 22:32:04 2011 when md1 had a failure.
The reshape couldn't continue now, so it stopped.

So the data on sdd1 is only (there has been about 8 minutes of reshape since
then) and cannot be used.
The data on md1 is very close to the rest.  The data that was in the process
of being relocated lives in two locations on the 'good' drives, both the new
and the old.  It only lives in the 'old' location on md1.

So what we need to do is re-assemble the array, but telling it that the
reshape has only gone as far as md1 thinks it has.  This will make sure it
repeats that last part of the reshape.

mdadm -Af should do that BUT IT DOESN'T.  Assuming I have thought through
this properly (and I should go through it again with more care), mdadm won't
do the right thing for you.  I need to get it to handle 'reshape' specially
when doing a --force assemble.

> 
> > bernstein@server:~$ sudo mdadm --assemble /dev/md2 /dev/sda1 /dev/md0 /dev/md1 /dev/sdd1 /dev/sdc1
> >
> > mdadm: /dev/md2 assembled from 3 drives - not enough to start the array.
> > bernstein@server:~$ cat /proc/mdstat
> > Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
> > md2 : inactive sdc1[0](S) sda1[5](S) md0[4](S) md1[3](S) sdd1[1](S)
> >       4883823704 blocks super 1.2
> >
> > md1 : active raid0 sdf1[0] sdg1[1]
> >       976770944 blocks super 1.2 64k chunks
> >
> > md0 : active raid0 sdb1[1] sdh1[0]
> >       976770944 blocks super 1.2 64k chunks
> >
> > unused devices: <none>
> 
> i do have a backup but since recovery from it takes a few days, i'd
> like to know if there is a way to recover the array or if it's
> completely lost.
> 
> Any suggestions gratefully received,

The fact that you have a backup is excellent.  You might need it, but I hope
not.

I would like to provide you with a modified version of mdadm which you can
then user to --force assemble the array.  It should be able to get you access
to all your data.
The array will be degraded and will finish reshape in that state.  Then you
will need to add sdd1 back in (Assuming you are confident that it works) and
it will be rebuilt.

Just to go through some of the numbers...

Chunk size is 64K.  Reshape was 4->5, so 3 -> 4 data disks.
So old stripes have 192K, new stripes have 256K.

The 'good' disks think reshape has reached 502815488K which is
1964123 new stripes. (2618830.66 old stripes)
md1 thinks reshape has only reached 489510400K which is 1912150
new stripes (2549533.33 old stripes).

So of the 51973 stripes that have been reshaped since the last metadata
update on sdd1, some will have been done on sdd1, but some not, and we don't
really know how many.  But it is perfectly safe to repeat those stripes
as all writes to that region will have been suspended (and you probably
weren't writing anyway).

So I need to change the loop in Assemble.c which calls ->update_super
with "force-one" to also make sure the reshape_position in the 'chosen'
superblock match the oldest 'forced' superblock.

So if you are able to wait a day, I'll try to write a patch first thing
tomorrow and send it to you.

Thanks for the excellent problem report.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html