Re: Likely forced assemby with wrong disk during raid5 grow. Recoverable?

Claude Nobs <claudenobs@xxxxxxxxx> · Sun, 20 Feb 2011 15:44:35 +0100

On Sun, Feb 20, 2011 at 06:25, NeilBrown <neilb@xxxxxxx> wrote:
> On Sun, 20 Feb 2011 04:23:09 +0100 Claude Nobs <claudenobs@xxxxxxxxx> wrote:
>
>> Hi All,
>>
>> I was wondering if someone might be willing to share if this array is
>> recoverable.
>>
>
> Probably is. ÂBut don't do anything yet - any further action until you have
> read all of the following email, will probably cause more harm than good.
>
>> I had a clean, running raid5 using 4 block devices (two of those were
>> 2 disk raid0 md devices) in RAID 5. Last night I decided it was safe
>> to grow the array by one disk. But then a) a disk failed, b) a power
>> loss occured, c) i probably switched the wrong disk and forced
>> assembly, resulting in an inconsistent state. Here is a complete set
>> of actions taken :
>
> Providing this level of information is excellent!
>
>
>>
>> > bernstein@server:~$ sudo mdadm --grow --raid-devices=5 --backup-file=/raid.grow.backupfile /dev/md2
>> > mdadm: Need to backup 768K of critical section..
>> > mdadm: ... critical section passed.
>> > bernstein@server:~$ cat /proc/mdstat
>> > Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
>> > md1 : active raid0 sdg1[1] sdf1[0]
>> > ÂÂÂÂÂ 976770944 blocks super 1.2 64k chunks
>> >
>> > md2 : active raid5 sda1[5] md0[4] md1[3] sdd1[1] sdc1[0]
>> > ÂÂÂÂÂ 2930281920 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
>> > ÂÂÂÂÂ [>....................]Â reshape =Â 1.6% (16423164/976760640) finish=902.2min speed=17739K/sec
>> >
>> > md0 : active raid0 sdh1[0] sdb1[1]
>> > ÂÂÂÂÂ 976770944 blocks super 1.2 64k chunks
>> >
>> > unused devices: <none>
>
> All looks good so-far.
>
>>
>>
>> now i thought /dev/sdg1 failed. unfortunately i have no log for this
>> one, just my memory of seeing this changed to the one above :
>>
>> > ÂÂÂÂÂ 2930281920 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/5] [UU_UU]
>>
>
> Unfortunately it is not possible to know which drive is missing from the
> above info. ÂThe [numbers] is brackets don't exactly corresponds to the
> positions in the array that you might thing they do. ÂThe mdstat listing above
> has numbers 0,1,3,4,5.
>
> They are the 'Number' column in the --detail output below. ÂThis is /dev/md1
> - I can tell from the --examine outputs, but it is a bit confusing. ÂNewer
> versions of mdadm make this a little less confusing. ÂIf you look for
> patterns of U and u Âin the 'Array State' line, the U is 'this device', the
> 'u' is some other devices.

Actually this is running a stock Ubunutu 10.10 server kernel. But as
it is from my memory it could very well have been :

       2930281920 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/5] [U_UUU]

>
> So /dev/md1 had a failure, so it could well have been sdg1.
>
>
>> some 10 minutes later a power loss occurred, thanks to an ups the
>> server shut down as with 'shutdown -h now'. now i exchanged /dev/sdg1,
>> rebooted and in a lapse of judgement forced assembly:
>
> Perfect timing :-)
>
>>
>> > bernstein@server:~$ sudo mdadm --assemble --run /dev/md2 /dev/md0 /dev/sda1 /dev/sdc1 /dev/sdd1
>> > mdadm: Could not open /dev/sda1 for write - cannot Assemble array.
>> > mdadm: Failed to restore critical section for reshape, sorry.
>
> This isn't actually a 'forced assembly' as you seem to think. ÂThere is no
> '-f' or '--force'. ÂIt didn't cause any harm.

phew... at last some luck! that "Failed to restore critical section
for reshape, sorry" really scared the hell out of me.
But then again it got me paying attention and stop making things worse... :-)

>
>> >
>> > bernstein@server:~$ sudo mdadm --detail /dev/md2
>> > /dev/md2:
>> > ÂÂÂÂÂÂÂ Version : 01.02
>> > Â Creation Time : Sat Jan 22 00:15:43 2011
>> > ÂÂÂÂ Raid Level : raid5
>> > Â Used Dev Size : 976760640 (931.51 GiB 1000.20 GB)
>> > ÂÂ Raid Devices : 5
>> > Â Total Devices : 3
>> > Preferred Minor : 3
>> > ÂÂÂ Persistence : Superblock is persistent
>> >
>> > ÂÂÂ Update Time : Sat Feb 19 22:32:04 2011
>> > ÂÂÂÂÂÂÂÂÂ State : active, degraded, Not Started
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â^^^^^^^^^^^^
>
> mdadm has put the devices together as best it can, but has not started the
> array because it didn't have enough devices. ÂThis is good.
>
>
>> > ÂActive Devices : 3
>> > Working Devices : 3
>> > ÂFailed Devices : 0
>> > Â Spare Devices : 0
>> >
>> > ÂÂÂÂÂÂÂÂ Layout : left-symmetric
>> > ÂÂÂÂ Chunk Size : 64K
>> >
>> > Â Delta Devices : 1, (4->5)
>> >
>> > ÂÂÂÂÂÂÂÂÂÂ Name : master:public
>> > ÂÂÂÂÂÂÂÂÂÂ UUID : c3b6db19:b61c3ba9:0a74b12b:3041a523
>> > ÂÂÂÂÂÂÂÂ Events : 133609
>> >
>> > ÂÂÂ NumberÂÂ MajorÂÂ MinorÂÂ RaidDevice State
>> > ÂÂÂÂÂÂ 0ÂÂÂÂÂÂ 8ÂÂÂÂÂÂ 33ÂÂÂÂÂÂÂ 0ÂÂÂÂÂ active syncÂÂ /dev/sdc1
>> > ÂÂÂÂÂÂ 1ÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂ 1ÂÂÂÂÂ removed
>> > ÂÂÂÂÂÂ 2ÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂ 2ÂÂÂÂÂ removed
>> > ÂÂÂÂÂÂ 4ÂÂÂÂÂÂ 9ÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂ 3ÂÂÂÂÂ active syncÂÂ /dev/block/9:0
>> > ÂÂÂÂÂÂ 5ÂÂÂÂÂÂ 8ÂÂÂÂÂÂÂ 1ÂÂÂÂÂÂÂ 4ÂÂÂÂÂ active syncÂÂ /dev/sda1
>
> Some you now have 2 devices missing. ÂAlong as we can find the devices,
> Âmdadm --assemble --force
> should be able to put them togethe for you. ÂBut let's see Âwhat we have...
>
>>
>> so i reattached the old disk, got /dev/md1 back and did the
>> investigation i should have done before :
>>
>> > bernstein@server:~$ sudo mdadm --examine /dev/sdd1
>> > /dev/sdd1:
>> > ÂÂÂÂÂÂÂÂÂ Magic : a92b4efc
>> > ÂÂÂÂÂÂÂ Version : 1.2
>> > ÂÂÂ Feature Map : 0x4
>> > ÂÂÂÂ Array UUID : c3b6db19:b61c3ba9:0a74b12b:3041a523
>> > ÂÂÂÂÂÂÂÂÂÂ Name : master:public
>> > Â Creation Time : Sat Jan 22 00:15:43 2011
>> > ÂÂÂÂ Raid Level : raid5
>> > ÂÂ Raid Devices : 5
>> >
>> > ÂAvail Dev Size : 1953521392 (931.51 GiB 1000.20 GB)
>> > ÂÂÂÂ Array Size : 7814085120 (3726.05 GiB 4000.81 GB)
>> > Â Used Dev Size : 1953521280 (931.51 GiB 1000.20 GB)
>> > ÂÂÂ Data Offset : 272 sectors
>> > ÂÂ Super Offset : 8 sectors
>> > ÂÂÂÂÂÂÂÂÂ State : clean
>> > ÂÂÂ Device UUID : 5e37fc7c:50ff3b50:de3755a1:6bdbebc6
>> >
>> > Â Reshape pos'n : 489510400 (466.83 GiB 501.26 GB)
>> > Â Delta Devices : 1 (4->5)
>> >
>> > ÂÂÂ Update Time : Sat Feb 19 22:23:09 2011
>> > ÂÂÂÂÂÂ Checksum : fd0c1794 - correct
>> > ÂÂÂÂÂÂÂÂ Events : 133567
>> >
>> > ÂÂÂÂÂÂÂÂ Layout : left-symmetric
>> > ÂÂÂÂ Chunk Size : 64K
>> >
>> > ÂÂÂ Array Slot : 1 (0, 1, failed, 2, 3, 4)
>> > ÂÂ Array State : uUuuu 1 failed
>
> This device thinks all is well. ÂThe "1 failed" is misleading. ÂThe
> Â uUuuu
> patterns says that all the devices are though to be working.
> Note for later reference:
> Â Â Â Â Events: 133567
> ÂReshape pos'n : 489510400
>
>
>> > bernstein@server:~$ sudo mdadm --examine /dev/sda1
>> > /dev/sda1:
>> > ÂÂÂÂÂÂÂÂÂ Magic : a92b4efc
>> > ÂÂÂÂÂÂÂ Version : 1.2
>> > ÂÂÂ Feature Map : 0x4
>> > ÂÂÂÂ Array UUID : c3b6db19:b61c3ba9:0a74b12b:3041a523
>> > ÂÂÂÂÂÂÂÂÂÂ Name : master:public
>> > Â Creation Time : Sat Jan 22 00:15:43 2011
>> > ÂÂÂÂ Raid Level : raid5
>> > ÂÂ Raid Devices : 5
>> >
>> > ÂAvail Dev Size : 1953521392 (931.51 GiB 1000.20 GB)
>> > ÂÂÂÂ Array Size : 7814085120 (3726.05 GiB 4000.81 GB)
>> > Â Used Dev Size : 1953521280 (931.51 GiB 1000.20 GB)
>> > ÂÂÂ Data Offset : 272 sectors
>> > ÂÂ Super Offset : 8 sectors
>> > ÂÂÂÂÂÂÂÂÂ State : clean
>> > ÂÂÂ Device UUID : baebd175:e4128e4c:f768b60f:4df18f77
>> >
>> > Â Reshape pos'n : 502815488 (479.52 GiB 514.88 GB)
>> > Â Delta Devices : 1 (4->5)
>> >
>> > ÂÂÂ Update Time : Sat Feb 19 22:32:04 2011
>> > ÂÂÂÂÂÂ Checksum : 12c832c6 - correct
>> > ÂÂÂÂÂÂÂÂ Events : 133609
>> >
>> > ÂÂÂÂÂÂÂÂ Layout : left-symmetric
>> > ÂÂÂÂ Chunk Size : 64K
>> >
>> > ÂÂÂ Array Slot : 5 (0, failed, failed, failed, 3, 4)
>> > ÂÂ Array State : u__uU 3 failed
>
> This device thinks devices 1 and 2 have failed (the '_'s).
> So 'sdd1' above, and and md1.
> Â Â Â ÂEvents : 133609 - this has advanced a bit from sdd1
> ÂReshape Pos'n : 502815488 - this has advanced quite a lot.
>
>
>> > bernstein@server:~$ sudo mdadm --examine /dev/sdc1
>> > /dev/sdc1:
>> > ÂÂÂÂÂÂÂÂÂ Magic : a92b4efc
>> > ÂÂÂÂÂÂÂ Version : 1.2
>> > ÂÂÂ Feature Map : 0x4
>> > ÂÂÂÂ Array UUID : c3b6db19:b61c3ba9:0a74b12b:3041a523
>> > ÂÂÂÂÂÂÂÂÂÂ Name : master:public
>> > Â Creation Time : Sat Jan 22 00:15:43 2011
>> > ÂÂÂÂ Raid Level : raid5
>> > ÂÂ Raid Devices : 5
>> >
>> > ÂAvail Dev Size : 1953521392 (931.51 GiB 1000.20 GB)
>> > ÂÂÂÂ Array Size : 7814085120 (3726.05 GiB 4000.81 GB)
>> > Â Used Dev Size : 1953521280 (931.51 GiB 1000.20 GB)
>> > ÂÂÂ Data Offset : 272 sectors
>> > ÂÂ Super Offset : 8 sectors
>> > ÂÂÂÂÂÂÂÂÂ State : clean
>> > ÂÂÂ Device UUID : 82f5284a:2bffb837:19d366ab:ef2e3d94
>> >
>> > Â Reshape pos'n : 502815488 (479.52 GiB 514.88 GB)
>> > Â Delta Devices : 1 (4->5)
>> >
>> > ÂÂÂ Update Time : Sat Feb 19 22:32:04 2011
>> > ÂÂÂÂÂÂ Checksum : 8aa7d094 - correct
>> > ÂÂÂÂÂÂÂÂ Events : 133609
>> >
>> > ÂÂÂÂÂÂÂÂ Layout : left-symmetric
>> > ÂÂÂÂ Chunk Size : 64K
>> >
>> > ÂÂÂ Array Slot : 0 (0, failed, failed, failed, 3, 4)
>> > ÂÂ Array State : U__uu 3 failed
>
> ÂReshape pos'n, Events, and Array State are identical to sda1.
> So these two are in agreement.
>
>
>> > bernstein@server:~$ sudo mdadm --examine /dev/md0
>> > /dev/md0:
>> > ÂÂÂÂÂÂÂÂÂ Magic : a92b4efc
>> > ÂÂÂÂÂÂÂ Version : 1.2
>> > ÂÂÂ Feature Map : 0x4
>> > ÂÂÂÂ Array UUID : c3b6db19:b61c3ba9:0a74b12b:3041a523
>> > ÂÂÂÂÂÂÂÂÂÂ Name : master:public
>> > Â Creation Time : Sat Jan 22 00:15:43 2011
>> > ÂÂÂÂ Raid Level : raid5
>> > ÂÂ Raid Devices : 5
>> >
>> > ÂAvail Dev Size : 1953541616 (931.52 GiB 1000.21 GB)
>> > ÂÂÂÂ Array Size : 7814085120 (3726.05 GiB 4000.81 GB)
>> > Â Used Dev Size : 1953521280 (931.51 GiB 1000.20 GB)
>> > ÂÂÂ Data Offset : 272 sectors
>> > ÂÂ Super Offset : 8 sectors
>> > ÂÂÂÂÂÂÂÂÂ State : clean
>> > ÂÂÂ Device UUID : 83ecd60d:f3947a5e:a69c4353:3c4a0893
>> >
>> > Â Reshape pos'n : 502815488 (479.52 GiB 514.88 GB)
>> > Â Delta Devices : 1 (4->5)
>> >
>> > ÂÂÂ Update Time : Sat Feb 19 22:32:04 2011
>> > ÂÂÂÂÂÂ Checksum : 1bbf913b - correct
>> > ÂÂÂÂÂÂÂÂ Events : 133609
>> >
>> > ÂÂÂÂÂÂÂÂ Layout : left-symmetric
>> > ÂÂÂÂ Chunk Size : 64K
>> >
>> > ÂÂÂ Array Slot : 4 (0, failed, failed, failed, 3, 4)
>> > ÂÂ Array State : u__Uu 3 failed
>
> again, exactly the same as sda1 and sdc1.
>
>> > bernstein@server:~$ sudo mdadm --examine /dev/md1
>> > /dev/md1:
>> > ÂÂÂÂÂÂÂÂÂ Magic : a92b4efc
>> > ÂÂÂÂÂÂÂ Version : 1.2
>> > ÂÂÂ Feature Map : 0x4
>> > ÂÂÂÂ Array UUID : c3b6db19:b61c3ba9:0a74b12b:3041a523
>> > ÂÂÂÂÂÂÂÂÂÂ Name : master:public
>> > Â Creation Time : Sat Jan 22 00:15:43 2011
>> > ÂÂÂÂ Raid Level : raid5
>> > ÂÂ Raid Devices : 5
>> >
>> > ÂAvail Dev Size : 1953541616 (931.52 GiB 1000.21 GB)
>> > ÂÂÂÂ Array Size : 7814085120 (3726.05 GiB 4000.81 GB)
>> > Â Used Dev Size : 1953521280 (931.51 GiB 1000.20 GB)
>> > ÂÂÂ Data Offset : 272 sectors
>> > ÂÂ Super Offset : 8 sectors
>> > ÂÂÂÂÂÂÂÂÂ State : clean
>> > ÂÂÂ Device UUID : 3c7e2c3f:8b6c7c43:a0ce7e33:ad680bed
>> >
>> > Â Reshape pos'n : 502809856 (479.52 GiB 514.88 GB)
>> > Â Delta Devices : 1 (4->5)
>> >
>> > ÂÂÂ Update Time : Sat Feb 19 22:30:29 2011
>> > ÂÂÂÂÂÂ Checksum : 6c591e90 - correct
>> > ÂÂÂÂÂÂÂÂ Events : 133603
>> >
>> > ÂÂÂÂÂÂÂÂ Layout : left-symmetric
>> > ÂÂÂÂ Chunk Size : 64K
>> >
>> > ÂÂÂ Array Slot : 3 (0, failed, failed, 2, 3, 4)
>> > ÂÂ Array State : u_Uuu 2 failed
>
> And here is md1. ÂThinks device 2 - sdd1 - has failed.
> Â Â Â ÂEvents : 133603 - slightly behind the 3 good devices, be well after
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âsdd1
> ÂReshape Pos'n : 502809856 - just a little before the 3 good devices too.
>
>>
>> so obviously not /dev/sdd1 failed. however (due to that silly forced
>> assembly?!) the reshape pos'n field of md0, sd[ac]1 differs from md1 a
>> few bytes, resulting in an inconsistent state...
>
> The way I read it is:
>
> Âsdd1 failed first - shortly after Sat Feb 19 22:23:09 2011 - the updateÂtime on sdd1
> reshape continued until some time between Sat Feb 19 22:30:29 2011
> and Sat Feb 19 22:32:04 2011 when md1 had a failure.
> The reshape couldn't continue now, so it stopped.
>
> So the data on sdd1 is only (there has been about 8 minutes of reshape since
> then) and cannot be used.
> The data on md1 is very close to the rest. ÂThe data that was in the process
> of being relocated lives in two locations on the 'good' drives, both the new
> and the old. ÂIt only lives in the 'old' location on md1.
>
> So what we need to do is re-assemble the array, but telling it that the
> reshape has only gone as far as md1 thinks it has. ÂThis will make sure it
> repeats that last part of the reshape.
>
> mdadm -Af should do that BUT IT DOESN'T. ÂAssuming I have thought through
> this properly (and I should go through it again with more care), mdadm won't
> do the right thing for you. ÂI need to get it to handle 'reshape' specially
> when doing a --force assemble.

exactly what i was thinking of doing, glad i waited and asked.

>
>>
>> > bernstein@server:~$ sudo mdadm --assemble /dev/md2 /dev/sda1 /dev/md0 /dev/md1 /dev/sdd1 /dev/sdc1
>> >
>> > mdadm: /dev/md2 assembled from 3 drives - not enough to start the array.
>> > bernstein@server:~$ cat /proc/mdstat
>> > Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
>> > md2 : inactive sdc1[0](S) sda1[5](S) md0[4](S) md1[3](S) sdd1[1](S)
>> > ÂÂÂÂÂ 4883823704 blocks super 1.2
>> >
>> > md1 : active raid0 sdf1[0] sdg1[1]
>> > ÂÂÂÂÂ 976770944 blocks super 1.2 64k chunks
>> >
>> > md0 : active raid0 sdb1[1] sdh1[0]
>> > ÂÂÂÂÂ 976770944 blocks super 1.2 64k chunks
>> >
>> > unused devices: <none>
>>
>> i do have a backup but since recovery from it takes a few days, i'd
>> like to know if there is a way to recover the array or if it's
>> completely lost.
>>
>> Any suggestions gratefully received,
>
> The fact that you have a backup is excellent. ÂYou might need it, but I hope
> not.
>
> I would like to provide you with a modified version of mdadm which you can
> then user to --force assemble the array. ÂIt should be able to get you access
> to all your data.
> The array will be degraded and will finish reshape in that state. ÂThen you
> will need to add sdd1 back in (Assuming you are confident that it works) and
> it will be rebuilt.
>
> Just to go through some of the numbers...
>
> Chunk size is 64K. ÂReshape was 4->5, so 3 -> 4 data disks.
> So old stripes have 192K, new stripes have 256K.
>
> The 'good' disks think reshape has reached 502815488K which is
> 1964123 new stripes. (2618830.66 old stripes)
> md1 thinks reshape has only reached 489510400K which is 1912150
> new stripes (2549533.33 old stripes).

i think you mixed up sdd1 with md1 here? (the numbers above for md1
are for sdd1. md1 would be :  reshape has reached 502809856K which
would be 1964101 new stripes. so the difference between the good disks
and md1 would be 22 stripes.)

>
> So of the 51973 stripes that have been reshaped since the last metadata
> update on sdd1, some will have been done on sdd1, but some not, and we don't
> really know how many. ÂBut it is perfectly safe to repeat those stripes
> as all writes to that region will have been suspended (and you probably
> weren't writing anyway).

jep there was nothing writing to the array. so now i am a little
confused, if you meant sdd1 (which failed first is 51973 stripes
behind) this would imply that at least so many stripes of data are
kept of the old (3 data disks) configuration as well as the new one?
if continuing from there is possible then the array would no longer be
degraded right? so i think you meant md1 (22 stripes behind), as
keeping 5.5M of data from the old and new config seems more
reasonable. however this is just a guess :-)

>
> So I need to change the loop in Assemble.c which calls ->update_super
> with "force-one" to also make sure the reshape_position in the 'chosen'
> superblock match the oldest 'forced' superblock.

uh... ah... probably, i have zero knowledge of kernel code :-)
i guess it should take into account that the oldest superblock (sdd1
in this case) may already be out of the section were the data (in the
old config) still exists? but i guess you already thought of that...

>
> So if you are able to wait a day, I'll try to write a patch first thing
> tomorrow and send it to you.

sure, that would be awesome! that boils down to compiling the patched
kernel doesn't it? this will probably take a few days as the system is
quite slow and i'd have to get up to speed with kernel compiling. but
shouldn't be a problem. would i have to patch the ubuntu kernel (based
on 2.6.35.4) or the latest 2.6.38-rc from kernel.org?

>
> Thanks for the excellent problem report.
>
> NeilBrown

Well i thank you for providing such an elaborate and friendly answer!
this is actually my first mailing list post and considering how many
questions get ignored (don't know about this list though) i just hoped
someone would at least answer with a one liner... i never expected
this. so thanks again.

Claude
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html