On Thu, 7 Apr 2011 22:00:58 +0200 sanktnelson 1 <sanktnelson@xxxxxxxxxxxxxx> wrote: > Hi all, > I need some advice on recovering from a catastrophic RAID5 failure > (and a possible f***-up on my part in trying to recover). I didn't > find this covered in any of the howtos out there, so here is my > problem: > I was trying to rebuild my RAID5 (5 drives) after I had replaced one > failed drive (sdc) with a fresh one. During the rebuild another drive > (sdd) failed. so I was left with the sdc as spare (S) and sdd failed > (F). > first question: what should I have done at this point? I'm fairly > certain that the error that failed sdd was a fluke, loose cable or > something, so I wanted mdadm to just assume it was clean and retry the > rebuild. What I would have done is stop the array (mdadm -S /dev/md0) and re-assemble it with --force. This would get you the degraded array back. Then I would backup and data that I really couldn't live without - just in case. Then I would stop the array, dd-rescue sdd to some other device, possible sdc, and assemble the array with the known-good devices and the new device (which might have been sdc) and NOT sdd. This would give me a degraded array of good devices. Then I would add another good device - maybe sdc if I thought that it was just a ready error and writes would work. Then wait and hope. > > what I actually did was reboot the machine in the hope it would just > restart the array in the previous degraded state, which of course it > did not. Instead all drives except the failed sdd were reported as > spare(S) in /proc/mdstat. > > so I tried > > root@enterprise:~# mdadm --run /dev/md0 > mdadm: failed to run array /dev/md0: Input/output error > > syslog showed at this point: > > Apr 7 20:37:49 localhost kernel: [ 893.981851] md: kicking non-fresh > sdd1 from array! > Apr 7 20:37:49 localhost kernel: [ 893.981864] md: unbind<sdd1> > Apr 7 20:37:49 localhost kernel: [ 893.992526] md: export_rdev(sdd1) > Apr 7 20:37:49 localhost kernel: [ 893.995844] raid5: device sdb1 > operational as raid disk 3 > Apr 7 20:37:49 localhost kernel: [ 893.995848] raid5: device sdf1 > operational as raid disk 4 > Apr 7 20:37:49 localhost kernel: [ 893.995852] raid5: device sde1 > operational as raid disk 2 > Apr 7 20:37:49 localhost kernel: [ 893.996353] raid5: allocated 5265kB for md0 > Apr 7 20:37:49 localhost kernel: [ 893.996478] 3: w=1 pa=0 pr=5 m=1 > a=2 r=5 op1=0 op2=0 > Apr 7 20:37:49 localhost kernel: [ 893.996482] 4: w=2 pa=0 pr=5 m=1 > a=2 r=5 op1=0 op2=0 > Apr 7 20:37:49 localhost kernel: [ 893.996485] 2: w=3 pa=0 pr=5 m=1 > a=2 r=5 op1=0 op2=0 > Apr 7 20:37:49 localhost kernel: [ 893.996488] raid5: not enough > operational devices for md0 (2/5 failed) > Apr 7 20:37:49 localhost kernel: [ 893.996514] RAID5 conf printout: > Apr 7 20:37:49 localhost kernel: [ 893.996517] --- rd:5 wd:3 > Apr 7 20:37:49 localhost kernel: [ 893.996520] disk 2, o:1, dev:sde1 > Apr 7 20:37:49 localhost kernel: [ 893.996522] disk 3, o:1, dev:sdb1 > Apr 7 20:37:49 localhost kernel: [ 893.996525] disk 4, o:1, dev:sdf1 > Apr 7 20:37:49 localhost kernel: [ 893.996898] raid5: failed to run > raid set md0 > Apr 7 20:37:49 localhost kernel: [ 893.996901] md: pers->run() failed ... > > so I figured I'd re-add sdd: > > root@enterprise:~# mdadm --re-add /dev/md0 /dev/sdd1 > mdadm: re-added /dev/sdd1 > root@enterprise:~# mdadm --run /dev/md0 > mdadm: started /dev/md0 This should be effectively equivalent to --assemble --force (I think). > > Apr 7 20:44:16 localhost kernel: [ 1281.139654] md: bind<sdd1> > Apr 7 20:44:16 localhost mdadm[1523]: SpareActive event detected on > md device /dev/md0, component device /dev/sdd1 > Apr 7 20:44:32 localhost kernel: [ 1297.147581] raid5: device sdd1 > operational as raid disk 0 > Apr 7 20:44:32 localhost kernel: [ 1297.147585] raid5: device sdb1 > operational as raid disk 3 > Apr 7 20:44:32 localhost kernel: [ 1297.147588] raid5: device sdf1 > operational as raid disk 4 > Apr 7 20:44:32 localhost kernel: [ 1297.147591] raid5: device sde1 > operational as raid disk 2 > Apr 7 20:44:32 localhost kernel: [ 1297.148102] raid5: allocated 5265kB for md0 > Apr 7 20:44:32 localhost kernel: [ 1297.148704] 0: w=1 pa=0 pr=5 m=1 > a=2 r=5 op1=0 op2=0 > Apr 7 20:44:32 localhost kernel: [ 1297.148708] 3: w=2 pa=0 pr=5 m=1 > a=2 r=5 op1=0 op2=0 > Apr 7 20:44:32 localhost kernel: [ 1297.148712] 4: w=3 pa=0 pr=5 m=1 > a=2 r=5 op1=0 op2=0 > Apr 7 20:44:32 localhost kernel: [ 1297.148715] 2: w=4 pa=0 pr=5 m=1 > a=2 r=5 op1=0 op2=0 > Apr 7 20:44:32 localhost kernel: [ 1297.148718] raid5: raid level 5 > set md0 active with 4 out of 5 devices, algorithm 2 > Apr 7 20:44:32 localhost kernel: [ 1297.148722] RAID5 conf printout: > Apr 7 20:44:32 localhost kernel: [ 1297.148725] --- rd:5 wd:4 > Apr 7 20:44:32 localhost kernel: [ 1297.148728] disk 0, o:1, dev:sdd1 > Apr 7 20:44:32 localhost kernel: [ 1297.148731] disk 2, o:1, dev:sde1 > Apr 7 20:44:32 localhost kernel: [ 1297.148734] disk 3, o:1, dev:sdb1 > Apr 7 20:44:32 localhost kernel: [ 1297.148737] disk 4, o:1, dev:sdf1 > Apr 7 20:44:32 localhost kernel: [ 1297.148779] md0: detected > capacity change from 0 to 6001196531712 > Apr 7 20:44:32 localhost kernel: [ 1297.149047] md0:RAID5 conf printout: > Apr 7 20:44:32 localhost kernel: [ 1297.149559] --- rd:5 wd:4 > Apr 7 20:44:32 localhost kernel: [ 1297.149562] disk 0, o:1, dev:sdd1 > Apr 7 20:44:32 localhost kernel: [ 1297.149565] disk 1, o:1, dev:sdc1 > Apr 7 20:44:32 localhost kernel: [ 1297.149568] disk 2, o:1, dev:sde1 > Apr 7 20:44:32 localhost kernel: [ 1297.149570] disk 3, o:1, dev:sdb1 > Apr 7 20:44:32 localhost kernel: [ 1297.149573] disk 4, o:1, dev:sdf1 > Apr 7 20:44:32 localhost kernel: [ 1297.149846] md: recovery of RAID array md0 > Apr 7 20:44:32 localhost kernel: [ 1297.149849] md: minimum > _guaranteed_ speed: 1000 KB/sec/disk. > Apr 7 20:44:32 localhost kernel: [ 1297.149852] md: using maximum > available idle IO bandwidth (but not more than 200000 KB/sec) for > recovery. > Apr 7 20:44:32 localhost kernel: [ 1297.149858] md: using 128k > window, over a total of 1465135872 blocks. > Apr 7 20:44:32 localhost kernel: [ 1297.188272] unknown partition table > Apr 7 20:44:32 localhost mdadm[1523]: RebuildStarted event detected > on md device /dev/md0 > > I figured this was definitely wrong, since I still couldn't mount > /dev/md0, so I manually failed sdc and sdd to stop any further > destruction on my part and to go seek expert help, so here I am. Is my > data still there or have the first few hundred MB been zeroed to > initialize a fresh array? how do I force mdadm to assume sdd is fresh > and give me access to the array without any writes happening to it? > Sorry to come running to the highest authority on linuxraid here, but > all the howtos out there are pretty thin when it comes to anything > more complicated than creating and array and recovering from one > failed drive. Well I wouldn't have failed the devices. I would simply have stopped the array mdamd -S /dev/md0 But I'm very surprised that this didn't work. md and mdadm never write zeros to initialise anything (Except a bitmap). Maybe the best thing to do at this point is post the output of mdadm -E /dev/sd[bcdef]1 and I'll see if I can make sense of it. NeilBrown > > Any advice (even if it is just for doing the right thing the next > time), is greatly appreciated. > > -Felix > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html