help please - recovering from failed RAID5 rebuild

sanktnelson 1 <sanktnelson@xxxxxxxxxxxxxx> · Thu, 7 Apr 2011 22:00:58 +0200

Hi all,
I need some advice on recovering from a catastrophic RAID5 failure
(and a possible f***-up on my part in trying to recover). I didn't
find this covered in any of the howtos out there, so here is my
problem:
I was trying to rebuild my RAID5 (5 drives) after I had replaced one
failed drive (sdc) with a fresh one. During the rebuild another drive
(sdd) failed. so I was left with the sdc as spare (S) and sdd failed
(F).
first question: what should I have done at this point? I'm fairly
certain that the error that failed sdd was a fluke, loose cable or
something, so I wanted mdadm to just assume it was clean and retry the
rebuild.

what I actually did was reboot the machine in the hope it would just
restart the array in the previous degraded state, which of course it
did not. Instead all drives except the failed sdd were reported as
spare(S) in /proc/mdstat.

so I tried

root@enterprise:~# mdadm --run /dev/md0
mdadm: failed to run array /dev/md0: Input/output error

syslog showed at this point:

Apr  7 20:37:49 localhost kernel: [  893.981851] md: kicking non-fresh
sdd1 from array!
Apr  7 20:37:49 localhost kernel: [  893.981864] md: unbind<sdd1>
Apr  7 20:37:49 localhost kernel: [  893.992526] md: export_rdev(sdd1)
Apr  7 20:37:49 localhost kernel: [  893.995844] raid5: device sdb1
operational as raid disk 3
Apr  7 20:37:49 localhost kernel: [  893.995848] raid5: device sdf1
operational as raid disk 4
Apr  7 20:37:49 localhost kernel: [  893.995852] raid5: device sde1
operational as raid disk 2
Apr  7 20:37:49 localhost kernel: [  893.996353] raid5: allocated 5265kB for md0
Apr  7 20:37:49 localhost kernel: [  893.996478] 3: w=1 pa=0 pr=5 m=1
a=2 r=5 op1=0 op2=0
Apr  7 20:37:49 localhost kernel: [  893.996482] 4: w=2 pa=0 pr=5 m=1
a=2 r=5 op1=0 op2=0
Apr  7 20:37:49 localhost kernel: [  893.996485] 2: w=3 pa=0 pr=5 m=1
a=2 r=5 op1=0 op2=0
Apr  7 20:37:49 localhost kernel: [  893.996488] raid5: not enough
operational devices for md0 (2/5 failed)
Apr  7 20:37:49 localhost kernel: [  893.996514] RAID5 conf printout:
Apr  7 20:37:49 localhost kernel: [  893.996517]  --- rd:5 wd:3
Apr  7 20:37:49 localhost kernel: [  893.996520]  disk 2, o:1, dev:sde1
Apr  7 20:37:49 localhost kernel: [  893.996522]  disk 3, o:1, dev:sdb1
Apr  7 20:37:49 localhost kernel: [  893.996525]  disk 4, o:1, dev:sdf1
Apr  7 20:37:49 localhost kernel: [  893.996898] raid5: failed to run
raid set md0
Apr  7 20:37:49 localhost kernel: [  893.996901] md: pers->run() failed ...

so I figured I'd re-add sdd:

root@enterprise:~# mdadm --re-add /dev/md0 /dev/sdd1
mdadm: re-added /dev/sdd1
root@enterprise:~# mdadm --run /dev/md0
mdadm: started /dev/md0

Apr  7 20:44:16 localhost kernel: [ 1281.139654] md: bind<sdd1>
Apr  7 20:44:16 localhost mdadm[1523]: SpareActive event detected on
md device /dev/md0, component device /dev/sdd1
Apr  7 20:44:32 localhost kernel: [ 1297.147581] raid5: device sdd1
operational as raid disk 0
Apr  7 20:44:32 localhost kernel: [ 1297.147585] raid5: device sdb1
operational as raid disk 3
Apr  7 20:44:32 localhost kernel: [ 1297.147588] raid5: device sdf1
operational as raid disk 4
Apr  7 20:44:32 localhost kernel: [ 1297.147591] raid5: device sde1
operational as raid disk 2
Apr  7 20:44:32 localhost kernel: [ 1297.148102] raid5: allocated 5265kB for md0
Apr  7 20:44:32 localhost kernel: [ 1297.148704] 0: w=1 pa=0 pr=5 m=1
a=2 r=5 op1=0 op2=0
Apr  7 20:44:32 localhost kernel: [ 1297.148708] 3: w=2 pa=0 pr=5 m=1
a=2 r=5 op1=0 op2=0
Apr  7 20:44:32 localhost kernel: [ 1297.148712] 4: w=3 pa=0 pr=5 m=1
a=2 r=5 op1=0 op2=0
Apr  7 20:44:32 localhost kernel: [ 1297.148715] 2: w=4 pa=0 pr=5 m=1
a=2 r=5 op1=0 op2=0
Apr  7 20:44:32 localhost kernel: [ 1297.148718] raid5: raid level 5
set md0 active with 4 out of 5 devices, algorithm 2
Apr  7 20:44:32 localhost kernel: [ 1297.148722] RAID5 conf printout:
Apr  7 20:44:32 localhost kernel: [ 1297.148725]  --- rd:5 wd:4
Apr  7 20:44:32 localhost kernel: [ 1297.148728]  disk 0, o:1, dev:sdd1
Apr  7 20:44:32 localhost kernel: [ 1297.148731]  disk 2, o:1, dev:sde1
Apr  7 20:44:32 localhost kernel: [ 1297.148734]  disk 3, o:1, dev:sdb1
Apr  7 20:44:32 localhost kernel: [ 1297.148737]  disk 4, o:1, dev:sdf1
Apr  7 20:44:32 localhost kernel: [ 1297.148779] md0: detected
capacity change from 0 to 6001196531712
Apr  7 20:44:32 localhost kernel: [ 1297.149047]  md0:RAID5 conf printout:
Apr  7 20:44:32 localhost kernel: [ 1297.149559]  --- rd:5 wd:4
Apr  7 20:44:32 localhost kernel: [ 1297.149562]  disk 0, o:1, dev:sdd1
Apr  7 20:44:32 localhost kernel: [ 1297.149565]  disk 1, o:1, dev:sdc1
Apr  7 20:44:32 localhost kernel: [ 1297.149568]  disk 2, o:1, dev:sde1
Apr  7 20:44:32 localhost kernel: [ 1297.149570]  disk 3, o:1, dev:sdb1
Apr  7 20:44:32 localhost kernel: [ 1297.149573]  disk 4, o:1, dev:sdf1
Apr  7 20:44:32 localhost kernel: [ 1297.149846] md: recovery of RAID array md0
Apr  7 20:44:32 localhost kernel: [ 1297.149849] md: minimum
_guaranteed_  speed: 1000 KB/sec/disk.
Apr  7 20:44:32 localhost kernel: [ 1297.149852] md: using maximum
available idle IO bandwidth (but not more than 200000 KB/sec) for
recovery.
Apr  7 20:44:32 localhost kernel: [ 1297.149858] md: using 128k
window, over a total of 1465135872 blocks.
Apr  7 20:44:32 localhost kernel: [ 1297.188272]  unknown partition table
Apr  7 20:44:32 localhost mdadm[1523]: RebuildStarted event detected
on md device /dev/md0

I figured this was definitely wrong, since I still couldn't mount
/dev/md0, so I manually failed sdc and sdd to stop any further
destruction on my part and to go seek expert help, so here I am. Is my
data still there or have the first few hundred MB been zeroed to
initialize a fresh array? how do I force mdadm to assume sdd is fresh
and give me access to the array without any writes happening to it?
Sorry to come running to the highest authority on linuxraid here, but
all the howtos out there are pretty thin when it comes to anything
more complicated than creating and array and recovering from one
failed drive.

Any advice (even if it is just for doing the right thing the next
time), is greatly appreciated.

-Felix
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html