Hi, I'm using a raid 5 with 4*400 GB PATA disks on a rather old VIA mainboard, running centos 5.0. a few days ago the server started to reboot or freeze occasionally, after reboot md always starts a resync of the raid: $ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid5 hdh1[3] hdg1[2] hdf1[1] hde1[0] 1172126208 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] [>....................] resync = 0.9% (3819132/390708736) finish=366.2min speed=17603K/sec unused devices: <none> after about an hour, the server freezes again. I figured out that about this time the following errors are reported in the messages log: Sep 23 22:23:05 alfred kernel: end_request: I/O error, dev hde, sector 254106007 Sep 23 22:23:09 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } Sep 23 22:23:09 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106015, high=15, low=2447775, sector=254106015 Sep 23 22:23:09 alfred kernel: end_request: I/O error, dev hde, sector 254106015 Sep 23 22:23:14 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } Sep 23 22:23:14 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106023, high=15, low=2447783, sector=254106023 Sep 23 22:23:14 alfred kernel: end_request: I/O error, dev hde, sector 254106023 Sep 23 22:23:18 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } Sep 23 22:23:18 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106031, high=15, low=2447791, sector=254106031 Sep 23 22:23:18 alfred kernel: end_request: I/O error, dev hde, sector 254106031 Sep 23 22:23:23 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } Sep 23 22:23:23 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106039, high=15, low=2447799, sector=254106039 Sep 23 22:23:23 alfred kernel: end_request: I/O error, dev hde, sector 254106039 Sep 23 22:23:43 alfred kernel: hde: dma_timer_expiry: dma status == 0x21 Sep 23 22:23:53 alfred kernel: hde: DMA timeout error Sep 23 22:23:53 alfred kernel: hde: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest } Sep 23 22:28:40 alfred kernel: ide2: BM-DMA at 0x7800-0x7807, BIOS settings: hde:DMA, hdf:pio now there are two things that puzzle me: 1) when md starts a resync of the array, shouldn't one drive be marked as down [_UUU] in mdstat instead of reporting it as [UUUU] ? or, the other way round: is hde really the faulty drive ? how can I make sure I'm removing and replacing the proper drive ? 2) can a faulty drive in a raid5 really crash the whole server ? maybe it's because of the bug in the onboard promise controller that adds to this problem (see attachment for dmesg output). tia.
Attachment:
dmesg
Description: Binary data