raid5 - which disk failed ?

Rainer Fuegenstein <rfu@xxxxxxxxxxxxxxxxxxxxxxxx> · Mon, 24 Sep 2007 01:17:10 +0200

Hi,

I'm using a raid 5 with 4*400 GB PATA disks on a rather old VIA
mainboard, running centos 5.0. a few days ago the server started to
reboot or freeze occasionally, after reboot md always starts a resync
of the raid:
$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 hdh1[3] hdg1[2] hdf1[1] hde1[0]
      1172126208 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
      [>....................]  resync =  0.9% (3819132/390708736) finish=366.2min speed=17603K/sec

unused devices: <none>

after about an hour, the server freezes again. I figured out that
about this time the following errors are reported in the messages log:

Sep 23 22:23:05 alfred kernel: end_request: I/O error, dev hde, sector 254106007
Sep 23 22:23:09 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Sep 23 22:23:09 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106015, high=15, low=2447775, sector=254106015
Sep 23 22:23:09 alfred kernel: end_request: I/O error, dev hde, sector 254106015
Sep 23 22:23:14 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Sep 23 22:23:14 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106023, high=15, low=2447783, sector=254106023
Sep 23 22:23:14 alfred kernel: end_request: I/O error, dev hde, sector 254106023
Sep 23 22:23:18 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Sep 23 22:23:18 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106031, high=15, low=2447791, sector=254106031
Sep 23 22:23:18 alfred kernel: end_request: I/O error, dev hde, sector 254106031
Sep 23 22:23:23 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Sep 23 22:23:23 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106039, high=15, low=2447799, sector=254106039
Sep 23 22:23:23 alfred kernel: end_request: I/O error, dev hde, sector 254106039
Sep 23 22:23:43 alfred kernel: hde: dma_timer_expiry: dma status == 0x21
Sep 23 22:23:53 alfred kernel: hde: DMA timeout error
Sep 23 22:23:53 alfred kernel: hde: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest }
Sep 23 22:28:40 alfred kernel:     ide2: BM-DMA at 0x7800-0x7807, BIOS settings: hde:DMA, hdf:pio

now there are two things that puzzle me:

1) when md starts a resync of the array, shouldn't one drive be marked
as down [_UUU] in mdstat instead of reporting it as [UUUU] ? or, the
other way round: is hde really the faulty drive ? how can I make sure
I'm removing and replacing the proper drive ?

2) can a faulty drive in a raid5 really crash the whole server ? maybe
it's because of the bug in the onboard promise controller that adds to
this problem (see attachment for dmesg output).

tia.
Attachment:
dmesg

Description: Binary data