On Tue, 9 Nov 2010 13:41:11 +0100 Sebastian Färber <faerber@xxxxxxxxx> wrote: > Hi, > > i just stumbled across a problem while rebuilding a MD RAID1 on 2.6.32.25. > The server has 2 disks, /dev/hda and /dev/sda. The RAID1 is degraded, so sda > was replaced and i tried rebuilding from /dev/hda to /dev/sdb. > While rebuilding i noticed that /dev/hda has some problems/bad sectors > but the kernel > seems to be stuck in some endless loop: > > -- > hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } > hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=239147198, > sector=239147057 > hda: possibly failed opcode: 0xc8 > end_request: I/O error, dev hda, sector 239147057 > hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } > hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=239148174, > sector=239148081 > hda: possibly failed opcode: 0xc8 > end_request: I/O error, dev hda, sector 239148081 > hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } > hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=239147213, > sector=239147209 > hda: possibly failed opcode: 0xc8 > end_request: I/O error, dev hda, sector 239147209 > raid1: hda: unrecoverable I/O read error for block 237892224 > hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } > hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=239148174, > sector=239148169 > hda: possibly failed opcode: 0xc8 > end_request: I/O error, dev hda, sector 239148169 > raid1: hda: unrecoverable I/O read error for block 237893120 > hda: dma_intr: status=0x51 { DriveReady SeekComplete Error } > hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=239148225, > sector=239148225 > hda: possibly failed opcode: 0xc8 > end_request: I/O error, dev hda, sector 239148225 > raid1: hda: unrecoverable I/O read error for block 237893248 > md: md1: recovery done. > RAID1 conf printout: > --- wd:1 rd:2 > disk 0, wo:0, o:1, dev:hda6 > disk 1, wo:1, o:1, dev:sda6 > RAID1 conf printout: > --- wd:1 rd:2 > disk 0, wo:0, o:1, dev:hda6 > disk 1, wo:1, o:1, dev:sda6 > RAID1 conf printout: > --- wd:1 rd:2 > disk 0, wo:0, o:1, dev:hda6 > disk 1, wo:1, o:1, dev:sda6 > -- > > I get a new "conf printout" message every few seconds until i used > mdadm to set /dev/sda6 to > "faulty". I know /dev/hda is bad and i probably won't be able to > rebuild the raid device, but this > endless loop seems fishy? Fishy indeed!! This was supposed to have been fixed by commit 4044ba58dd15cb01797c4fd034f39ef4a75f7cc3 in 2.6.29. But it seems not. The following patch should fix it properly. Are you able to apply this patch to your kernel, rebuild, and see if it makes the required difference? Thanks. I'm working on making md cope with this situation better and actually finish the recovery - recording where the bad blocks are so when you read from the new device, you can still get read errors, but when you over-write, the error goes away. But there are so many other things to do.... For now, your best bet might be to use dd-rescue (or is that ddrescue) to copy from hda6 to sda6, then stop using hda6. NeilBrown >From c074e12fe437827908bc31247a05aec4815e1a1b Mon Sep 17 00:00:00 2001 From: NeilBrown <neilb@xxxxxxx> Date: Mon, 15 Nov 2010 12:32:47 +1100 Subject: [PATCH] md/raid1: really fix recovery looping when single good device fails. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Commit 4044ba58dd15cb01797c4fd034f39ef4a75f7cc3 supposedly fixed a problem where if a raid1 with just one good device gets a read-error during recovery, the recovery would abort and immediately restart in an infinite loop. However it depended on raid1_remove_disk removing the spare device from the array. But that does not happen in this case. So add a test so that in the 'recovery_disabled', then device will be removed. Reported-by: Sebastian Färber <faerber@xxxxxxxxx> Signed-off-by: NeilBrown <neilb@xxxxxxx> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 45f8324..845cf95 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1161,6 +1161,7 @@ static int raid1_remove_disk(mddev_t *mddev, int number) * is not possible. */ if (!test_bit(Faulty, &rdev->flags) && + !mddev->recovery_disabled && mddev->degraded < conf->raid_disks) { err = -EBUSY; goto abort; -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html