Thanks for taking a look, David. Kernel: 2.6.15-27-k7, stock for Ubuntu 6.06 LTS mdadm: mdadm - v1.12.0 - 14 June 2005 You're right, earlier in /var/log/messages there's a notice that hdg dropped, I missed it before. I use mdadm --monitor, but I recently changed the target email address - I guess it didn't take properly. As for replacing hdc, thanks for the diagnosis but it won't help: the drive is actually fine, as is hdg. I've replaced hdc before, only to have the brand new hdc show the same behaviour, and SMART says the drive is A-OK. There's something flaky about these PCI IDE controllers. I think it's new system time. Reiserfs recovery-wise: any suggestions? A simple fsck doesn't find a file system superblock. Is --rebuild-sb the way to go here? Thanks, Neil On Nov 14, 2007 5:58 AM, David Greaves <david@xxxxxxxxxxxx> wrote: > Neil Cavan wrote: > > Hello, > Hi Neil > > What kernel version? > What mdadm version? > > > This morning, I woke up to find the array had kicked two disks. This > > time, though, /proc/mdstat showed one of the failed disks (U_U_U, one > > of the "_"s) had been marked as a spare - weird, since there are no > > spare drives in this array. I rebooted, and the array came back in the > > same state: one failed, one spare. I hot-removed and hot-added the > > spare drive, which put the array back to where I thought it should be > > ( still U_U_U, but with both "_"s marked as failed). Then I rebooted, > > and the array began rebuilding on its own. Usually I have to hot-add > > manually, so that struck me as a little odd, but I gave it no mind and > > went to work. Without checking the contents of the filesystem. Which > > turned out not to have been mounted on reboot. > OK > > > Because apparently things went horribly wrong. > Yep :( > > > Do I have any hope of recovering this data? Could rebuilding the > > reiserfs superblock help if the rebuild managed to corrupt the > > superblock but not the data? > See below > > > > > Nov 13 02:01:03 localhost kernel: [17805772.424000] hdc: dma_intr: > > status=0x51 { DriveReady SeekComplete Error } > <snip> > > Nov 13 02:01:06 localhost kernel: [17805775.156000] lost page write > > due to I/O error on md0 > hdc1 fails > > > > Nov 13 02:01:06 localhost kernel: [17805775.196000] RAID5 conf printout: > > Nov 13 02:01:06 localhost kernel: [17805775.196000] --- rd:5 wd:3 fd:2 > > Nov 13 02:01:06 localhost kernel: [17805775.196000] disk 0, o:1, dev:hda1 > > Nov 13 02:01:06 localhost kernel: [17805775.196000] disk 1, o:0, dev:hdc1 > > Nov 13 02:01:06 localhost kernel: [17805775.196000] disk 2, o:1, dev:hde1 > > Nov 13 02:01:06 localhost kernel: [17805775.196000] disk 4, o:1, dev:hdi1 > > hdg1 is already missing? > > > Nov 13 02:01:06 localhost kernel: [17805775.212000] RAID5 conf printout: > > Nov 13 02:01:06 localhost kernel: [17805775.212000] --- rd:5 wd:3 fd:2 > > Nov 13 02:01:06 localhost kernel: [17805775.212000] disk 0, o:1, dev:hda1 > > Nov 13 02:01:06 localhost kernel: [17805775.212000] disk 2, o:1, dev:hde1 > > Nov 13 02:01:06 localhost kernel: [17805775.212000] disk 4, o:1, dev:hdi1 > > so now the array is bad. > > a reboot happens and: > > Nov 13 07:21:07 localhost kernel: [17179584.712000] md: md0 stopped. > > Nov 13 07:21:07 localhost kernel: [17179584.876000] md: bind<hdc1> > > Nov 13 07:21:07 localhost kernel: [17179584.884000] md: bind<hde1> > > Nov 13 07:21:07 localhost kernel: [17179584.884000] md: bind<hdg1> > > Nov 13 07:21:07 localhost kernel: [17179584.884000] md: bind<hdi1> > > Nov 13 07:21:07 localhost kernel: [17179584.892000] md: bind<hda1> > > Nov 13 07:21:07 localhost kernel: [17179584.892000] md: kicking > > non-fresh hdg1 from array! > > Nov 13 07:21:07 localhost kernel: [17179584.892000] md: unbind<hdg1> > > Nov 13 07:21:07 localhost kernel: [17179584.892000] md: export_rdev(hdg1) > > Nov 13 07:21:07 localhost kernel: [17179584.896000] raid5: allocated > > 5245kB for md0 > ... apparently hdc1 is OK? Hmmm. > > > Nov 13 07:21:07 localhost kernel: [17179665.524000] ReiserFS: md0: > > found reiserfs format "3.6" with standard journal > > Nov 13 07:21:07 localhost kernel: [17179676.136000] ReiserFS: md0: > > using ordered data mode > > Nov 13 07:21:07 localhost kernel: [17179676.164000] ReiserFS: md0: > > journal params: device md0, size 8192, journal first block 18, max > > trans len 1024, max batch 900, max commit age 30, max trans age 30 > > Nov 13 07:21:07 localhost kernel: [17179676.164000] ReiserFS: md0: > > checking transaction log (md0) > > Nov 13 07:21:07 localhost kernel: [17179676.828000] ReiserFS: md0: > > replayed 7 transactions in 1 seconds > > Nov 13 07:21:07 localhost kernel: [17179677.012000] ReiserFS: md0: > > Using r5 hash to sort names > > Nov 13 07:21:09 localhost kernel: [17179682.064000] lost page write > > due to I/O error on md0 > Reiser tries to mount/replay itself relying on hdc1 (which is partly bad) > > > Nov 13 07:25:39 localhost kernel: [17179584.828000] md: raid5 > > personality registered as nr 4 > > Nov 13 07:25:39 localhost kernel: [17179585.708000] md: kicking > > non-fresh hdg1 from array! > Another reboot... > > > Nov 13 07:25:40 localhost kernel: [17179666.064000] ReiserFS: md0: > > found reiserfs format "3.6" with standard journal > > Nov 13 07:25:40 localhost kernel: [17179676.904000] ReiserFS: md0: > > using ordered data mode > > Nov 13 07:25:40 localhost kernel: [17179676.928000] ReiserFS: md0: > > journal params: device md0, size 8192, journal first block 18, max > > trans len 1024, max batch 900, max commit age 30, max trans age 30 > > Nov 13 07:25:40 localhost kernel: [17179676.932000] ReiserFS: md0: > > checking transaction log (md0) > > Nov 13 07:25:40 localhost kernel: [17179677.080000] ReiserFS: md0: > > Using r5 hash to sort names > > Nov 13 07:25:42 localhost kernel: [17179683.128000] lost page write > > due to I/O error on md0 > Reiser tries again... > > > Nov 13 07:26:57 localhost kernel: [17179757.524000] md: unbind<hdc1> > > Nov 13 07:26:57 localhost kernel: [17179757.524000] md: export_rdev(hdc1) > > Nov 13 07:27:03 localhost kernel: [17179763.700000] md: bind<hdc1> > > Nov 13 07:30:24 localhost kernel: [17179584.180000] md: md driver > hdc is kicked too (again) > > > Nov 13 07:30:24 localhost kernel: [17179584.184000] md: raid5 > > personality registered as nr 4 > Another reboot... > > > Nov 13 07:30:24 localhost kernel: [17179585.068000] md: syncing RAID array md0 > Now (I guess) hdg is being restored using hdc data: > > > Nov 13 07:30:24 localhost kernel: [17179684.160000] ReiserFS: md0: > > warning: sh-2021: reiserfs_fill_super: can not find reiserfs on md0 > But Reiser is confused. > > > Nov 13 08:57:11 localhost kernel: [17184895.816000] md: md0: sync done. > hdg is back up to speed: > > > So hdc looks faulty. > Your only hope (IMO) is to use reiserfs recovery tools. > You may want to replace hdc to avoid an hdc failure interrupting any rebuild. > > I think what happened is that hdg failed prior to 2am and you didn't notice > (mdadm --monitor is your friend). Then hdc had a real failure - at that point > you had data loss (not enough good disks). I don't know why md rebuilt using hdc > - I would expect it to have found hdc and hdg stale. If this is a newish kernel > then maybe Neil should take a look... > > David > > - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html