Re: Fwd: Error on /dev/sda, but takes down RAID-1

Michael Tokarev <mjt@xxxxxxxxxx> · Wed, 23 Jan 2008 21:35:32 +0300

Martin Seebach wrote:
> Hi, 
> 
> I'm not sure this is completely linux-raid related, but I can't figure out where to start: 
> 
> A few days ago, my server died. I was able to log in and salvage this content of dmesg: 
> http://pastebin.com/m4af616df 
> 
> I talked to my hosting-people and they said it was an io-error on /dev/sda, and replaced that drive. 
> After this, I was able to boot into a PXE-image and re-build the two RAID-1 devices with no problems - indicating that sdb was fine. 
> 
> I expected RAID-1 to be able to stomach exactly this kind of error - one drive dying. What did I do wrong? 

from that pastebin page.

First, sdb has failed for whatever reason:

ata2.00: qc timeout (cmd 0xec)
ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata2.00: revalidation failed (errno=-5)
ata2.00: disabled
ata2: EH complete
sd 1:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sdb, sector 80324865
raid1: Disk failure on sdb1, disabling device.
        Operation continuing on 1 devices
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:0, o:1, dev:sda1
 disk 1, wo:1, o:0, dev:sdb1
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:0, o:1, dev:sda1

At this time, it started to (re)sync other(?) arrays for
some reason:

md: syncing RAID array md0
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
md: using 128k window, over a total of 40162432 blocks.
md: md0: sync done.
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:0, o:1, dev:sda1
md: syncing RAID array md1
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction.
md: using 128k window, over a total of 100060736 blocks.

Note again, errors on sdb:

sd 1:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sdb, sector 112455000
sd 1:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sdb, sector 112455256
sd 1:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sdb, sector 112455512
...

raid1: Disk failure on sdb3, disabling device.
        Operation continuing on 1 devices

so another md array detected sdb failure.  So we're
with sda only.  And volia, sda fails too, some time
later:

ata1: EH complete
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 80324865
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 115481
...

At this point, the arrays are hosed - all disks
of each array has failed, there's no data any
more to read/write from/to.

Since later sda has been replaced, and sdb recovered
from the errors (it contains still-valid superblocks
but with somewhat stale information), everything
went ok.

But the original problem is that you had BOTH disks
failed, not only one.  What caused THIS problem is
another question.  Maybe some overheating or power
unit problem or somesuch, -- I don't know...  But
md code worked the best it can here.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html