Fwd: RAID5 Recovery

"Neil Cavan" <neilcavan@xxxxxxxxx> · Wed, 14 Nov 2007 07:36:29 -0500

Thanks for taking a look, David.

Kernel:
2.6.15-27-k7, stock for Ubuntu 6.06 LTS

mdadm:
mdadm - v1.12.0 - 14 June 2005

You're right, earlier in /var/log/messages there's a notice that hdg
dropped, I missed it before. I use mdadm --monitor, but I recently
changed the target email address - I guess it didn't take properly.

As for replacing hdc, thanks for the diagnosis but it won't help: the
drive is actually fine, as is hdg. I've replaced hdc before, only to
have the brand new hdc show the same behaviour, and SMART says the
drive is A-OK. There's something flaky about these PCI IDE
controllers. I think it's new system time.

Reiserfs recovery-wise: any suggestions? A simple fsck doesn't find a
file system superblock. Is --rebuild-sb the way to go here?

Thanks,
Neil

On Nov 14, 2007 5:58 AM, David Greaves <david@xxxxxxxxxxxx> wrote:
> Neil Cavan wrote:
> > Hello,
> Hi Neil
>
> What kernel version?
> What mdadm version?
>
> > This morning, I woke up to find the array had kicked two disks. This
> > time, though, /proc/mdstat showed one of the failed disks (U_U_U, one
> > of the "_"s) had been marked as a spare - weird, since there are no
> > spare drives in this array. I rebooted, and the array came back in the
> > same state: one failed, one spare. I hot-removed and hot-added the
> > spare drive, which put the array back to where I thought it should be
> > ( still U_U_U, but with both "_"s marked as failed). Then I rebooted,
> > and the array began rebuilding on its own. Usually I have to hot-add
> > manually, so that struck me as a little odd, but I gave it no mind and
> > went to work. Without checking the contents of the filesystem. Which
> > turned out not to have been mounted on reboot.
> OK
>
> > Because apparently things went horribly wrong.
> Yep :(
>
> > Do I have any hope of recovering this data? Could rebuilding the
> > reiserfs superblock help if the rebuild managed to corrupt the
> > superblock but not the data?
> See below
>
>
>
> > Nov 13 02:01:03 localhost kernel: [17805772.424000] hdc: dma_intr:
> > status=0x51 { DriveReady SeekComplete Error }
> <snip>
> > Nov 13 02:01:06 localhost kernel: [17805775.156000] lost page write
> > due to I/O error on md0
> hdc1 fails
>
>
> > Nov 13 02:01:06 localhost kernel: [17805775.196000] RAID5 conf printout:
> > Nov 13 02:01:06 localhost kernel: [17805775.196000]  --- rd:5 wd:3 fd:2
> > Nov 13 02:01:06 localhost kernel: [17805775.196000]  disk 0, o:1, dev:hda1
> > Nov 13 02:01:06 localhost kernel: [17805775.196000]  disk 1, o:0, dev:hdc1
> > Nov 13 02:01:06 localhost kernel: [17805775.196000]  disk 2, o:1, dev:hde1
> > Nov 13 02:01:06 localhost kernel: [17805775.196000]  disk 4, o:1, dev:hdi1
>
> hdg1 is already missing?
>
> > Nov 13 02:01:06 localhost kernel: [17805775.212000] RAID5 conf printout:
> > Nov 13 02:01:06 localhost kernel: [17805775.212000]  --- rd:5 wd:3 fd:2
> > Nov 13 02:01:06 localhost kernel: [17805775.212000]  disk 0, o:1, dev:hda1
> > Nov 13 02:01:06 localhost kernel: [17805775.212000]  disk 2, o:1, dev:hde1
> > Nov 13 02:01:06 localhost kernel: [17805775.212000]  disk 4, o:1, dev:hdi1
>
> so now the array is bad.
>
> a reboot happens and:
> > Nov 13 07:21:07 localhost kernel: [17179584.712000] md: md0 stopped.
> > Nov 13 07:21:07 localhost kernel: [17179584.876000] md: bind<hdc1>
> > Nov 13 07:21:07 localhost kernel: [17179584.884000] md: bind<hde1>
> > Nov 13 07:21:07 localhost kernel: [17179584.884000] md: bind<hdg1>
> > Nov 13 07:21:07 localhost kernel: [17179584.884000] md: bind<hdi1>
> > Nov 13 07:21:07 localhost kernel: [17179584.892000] md: bind<hda1>
> > Nov 13 07:21:07 localhost kernel: [17179584.892000] md: kicking
> > non-fresh hdg1 from array!
> > Nov 13 07:21:07 localhost kernel: [17179584.892000] md: unbind<hdg1>
> > Nov 13 07:21:07 localhost kernel: [17179584.892000] md: export_rdev(hdg1)
> > Nov 13 07:21:07 localhost kernel: [17179584.896000] raid5: allocated
> > 5245kB for md0
> ... apparently hdc1 is OK? Hmmm.
>
> > Nov 13 07:21:07 localhost kernel: [17179665.524000] ReiserFS: md0:
> > found reiserfs format "3.6" with standard journal
> > Nov 13 07:21:07 localhost kernel: [17179676.136000] ReiserFS: md0:
> > using ordered data mode
> > Nov 13 07:21:07 localhost kernel: [17179676.164000] ReiserFS: md0:
> > journal params: device md0, size 8192, journal first block 18, max
> > trans len 1024, max batch 900, max commit age 30, max trans age 30
> > Nov 13 07:21:07 localhost kernel: [17179676.164000] ReiserFS: md0:
> > checking transaction log (md0)
> > Nov 13 07:21:07 localhost kernel: [17179676.828000] ReiserFS: md0:
> > replayed 7 transactions in 1 seconds
> > Nov 13 07:21:07 localhost kernel: [17179677.012000] ReiserFS: md0:
> > Using r5 hash to sort names
> > Nov 13 07:21:09 localhost kernel: [17179682.064000] lost page write
> > due to I/O error on md0
> Reiser tries to mount/replay itself relying on hdc1 (which is partly bad)
>
> > Nov 13 07:25:39 localhost kernel: [17179584.828000] md: raid5
> > personality registered as nr 4
> > Nov 13 07:25:39 localhost kernel: [17179585.708000] md: kicking
> > non-fresh hdg1 from array!
> Another reboot...
>
> > Nov 13 07:25:40 localhost kernel: [17179666.064000] ReiserFS: md0:
> > found reiserfs format "3.6" with standard journal
> > Nov 13 07:25:40 localhost kernel: [17179676.904000] ReiserFS: md0:
> > using ordered data mode
> > Nov 13 07:25:40 localhost kernel: [17179676.928000] ReiserFS: md0:
> > journal params: device md0, size 8192, journal first block 18, max
> > trans len 1024, max batch 900, max commit age 30, max trans age 30
> > Nov 13 07:25:40 localhost kernel: [17179676.932000] ReiserFS: md0:
> > checking transaction log (md0)
> > Nov 13 07:25:40 localhost kernel: [17179677.080000] ReiserFS: md0:
> > Using r5 hash to sort names
> > Nov 13 07:25:42 localhost kernel: [17179683.128000] lost page write
> > due to I/O error on md0
> Reiser tries again...
>
> > Nov 13 07:26:57 localhost kernel: [17179757.524000] md: unbind<hdc1>
> > Nov 13 07:26:57 localhost kernel: [17179757.524000] md: export_rdev(hdc1)
> > Nov 13 07:27:03 localhost kernel: [17179763.700000] md: bind<hdc1>
> > Nov 13 07:30:24 localhost kernel: [17179584.180000] md: md driver
> hdc is kicked too (again)
>
> > Nov 13 07:30:24 localhost kernel: [17179584.184000] md: raid5
> > personality registered as nr 4
> Another reboot...
>
> > Nov 13 07:30:24 localhost kernel: [17179585.068000] md: syncing RAID array md0
> Now (I guess) hdg is being restored using hdc data:
>
> > Nov 13 07:30:24 localhost kernel: [17179684.160000] ReiserFS: md0:
> > warning: sh-2021: reiserfs_fill_super: can not find reiserfs on md0
> But Reiser is confused.
>
> > Nov 13 08:57:11 localhost kernel: [17184895.816000] md: md0: sync done.
> hdg is back up to speed:
>
>
> So hdc looks faulty.
> Your only hope (IMO) is to use reiserfs recovery tools.
> You may want to replace hdc to avoid an hdc failure interrupting any rebuild.
>
> I think what happened is that hdg failed prior to 2am and you didn't notice
> (mdadm --monitor is your friend). Then hdc had a real failure - at that point
> you had data loss (not enough good disks). I don't know why md rebuilt using hdc
> - I would expect it to have found hdc and hdg stale. If this is a newish kernel
> then maybe Neil should take a look...
>
> David
>
>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html