Mar 12 21:44:33 headache kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002 Mar 12 21:44:33 headache kernel: sda: Current: sense key: Hardware Error Mar 12 21:44:33 headache kernel: Additional sense: Defect list error Mar 12 21:44:33 headache kernel: end_request: I/O error, dev sda, sector 143363856 Mar 12 21:44:33 headache kernel: md: super_written gets error=-5, uptodate=0Mar 12 21:44:33 headache kernel: raid1: Disk failure on sda3, disabling device. Mar 12 21:44:33 headache kernel: Operation continuing on 1 devices
Mar 12 21:44:33 headache kernel: RAID1 conf printout: Mar 12 21:44:33 headache kernel: --- wd:1 rd:2 Mar 12 21:44:33 headache kernel: disk 0, wo:1, o:0, dev:sda3 Mar 12 21:44:33 headache kernel: disk 1, wo:0, o:1, dev:sdb3 Mar 12 21:44:33 headache kernel: RAID1 conf printout: Mar 12 21:44:33 headache kernel: --- wd:1 rd:2 Mar 12 21:44:33 headache kernel: disk 1, wo:0, o:1, dev:sdb3 I have two SCSI drives off an Adaptec AIC-7902B U320 (rev 10) controller. But smartctl gives this drive a clean bill of health: [root@headache ~]# smartctl -H /dev/sda smartctl version 5.36 [i386-redhat-linux-gnu] Copyright © 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ SMART Health Status: OKI have three RAID-1 partitions on these disks. The one that reported an error was the largest one. I dropped the degraded partition, and hot-added it back. Immediately, another error was logged to /var/log/messages, for the same block, but despite the error, the kernel started resyncing the array:
Mar 12 22:37:33 headache kernel: Buffer I/O error on device sda3, logical block 35262625 Mar 12 22:37:41 headache kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002 Mar 12 22:37:41 headache kernel: sda: Current: sense key: Medium Error Mar 12 22:37:41 headache kernel: Additional sense: Unrecovered read error Mar 12 22:37:41 headache kernel: Info fld=0x88b8f16 Mar 12 22:37:41 headache kernel: end_request: I/O error, dev sda, sector 143363862 Mar 12 22:37:41 headache kernel: Buffer I/O error on device sda3, logical block 35262625 Mar 12 22:37:41 headache kernel: md: bind<sda3> Mar 12 22:37:42 headache kernel: RAID1 conf printout: Mar 12 22:37:42 headache kernel: --- wd:1 rd:2 Mar 12 22:37:42 headache kernel: disk 0, wo:1, o:1, dev:sda3 Mar 12 22:37:42 headache kernel: disk 1, wo:0, o:1, dev:sdb3Despite the second error, the resync of the failed partition completed succesfully.
smartctl -a shows 80000+ read errors corrected by ECC/fast, no rereads, and 6 rewrites. My knowledge of SMART is limited. The other drive in this array shows 50000+ read errors corrected by ECC/fast, no rereads and no rewrites.
So, do the 6 rewrites on this drive is an indication of a looming failure? My second question is that the two drives are in a hot-swappable bay, and connected to the Adaptec AIC-7902B U320 controller. Hardware-wise, the drives are hot-swappable, but what about software-wise? If I take this drive entirely off RAID-1, cut the power to the hot-swap bay, pull the drive out, replace it, plug in back in, and reenable power, will the FC6 kernel be able to deal with this?
If I cannot do this, my third question is what do I need to do, grub-wise, to be able to swap sdb with sda? sda is the one that's failing the RAID-1 array. If I can't hot-swap it, I'll need to replace it with the sdb drive, but right now grub is installed only on sda, so how do I install a copy of all the grub boot-related stuff on sdb?
Attachment:
pgpOZ3xkZxVN9.pgp
Description: PGP signature