RAID drive failed, but SMART shows no errors?

Sam Varshavchik <mrsam@xxxxxxxxxxxxxxx> · Mon, 12 Mar 2007 23:18:59 -0400

One of my FC6 machines just claimed that one of two RAID-1 SCSI drives had 
an error:

Mar 12 21:44:33 headache kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
Mar 12 21:44:33 headache kernel: sda: Current: sense key: Hardware Error
Mar 12 21:44:33 headache kernel:     Additional sense: Defect list error
Mar 12 21:44:33 headache kernel: end_request: I/O error, dev sda, sector 143363856
Mar 12 21:44:33 headache kernel: md: super_written gets error=-5, uptodate=0
Mar 12 21:44:33 headache kernel: raid1: Disk failure on sda3, disabling 
device. 
Mar 12 21:44:33 headache kernel:        Operation continuing on 1 devices
Mar 12 21:44:33 headache kernel: RAID1 conf printout:
Mar 12 21:44:33 headache kernel:  --- wd:1 rd:2
Mar 12 21:44:33 headache kernel:  disk 0, wo:1, o:0, dev:sda3
Mar 12 21:44:33 headache kernel:  disk 1, wo:0, o:1, dev:sdb3
Mar 12 21:44:33 headache kernel: RAID1 conf printout:
Mar 12 21:44:33 headache kernel:  --- wd:1 rd:2
Mar 12 21:44:33 headache kernel:  disk 1, wo:0, o:1, dev:sdb3

I have two SCSI drives off an Adaptec AIC-7902B U320 (rev 10) controller.

But smartctl gives this drive a clean bill of health:

[root@headache ~]# smartctl -H /dev/sda
smartctl version 5.36 [i386-redhat-linux-gnu] Copyright © 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

SMART Health Status: OK

I have three RAID-1 partitions on these disks.  The one that reported an 
error was the largest one.  I dropped the degraded partition, and hot-added 
it back.  Immediately, another error was logged to /var/log/messages, for 
the same block, but despite the error, the kernel started resyncing the 
array:

Mar 12 22:37:33 headache kernel: Buffer I/O error on device sda3, logical block 35262625
Mar 12 22:37:41 headache kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
Mar 12 22:37:41 headache kernel: sda: Current: sense key: Medium Error
Mar 12 22:37:41 headache kernel:     Additional sense: Unrecovered read error
Mar 12 22:37:41 headache kernel: Info fld=0x88b8f16
Mar 12 22:37:41 headache kernel: end_request: I/O error, dev sda, sector 143363862
Mar 12 22:37:41 headache kernel: Buffer I/O error on device sda3, logical block 35262625
Mar 12 22:37:41 headache kernel: md: bind<sda3>
Mar 12 22:37:42 headache kernel: RAID1 conf printout:
Mar 12 22:37:42 headache kernel:  --- wd:1 rd:2
Mar 12 22:37:42 headache kernel:  disk 0, wo:1, o:1, dev:sda3
Mar 12 22:37:42 headache kernel:  disk 1, wo:0, o:1, dev:sdb3

Despite the second error, the resync of the failed partition completed 
succesfully.

smartctl -a shows 80000+ read errors corrected by ECC/fast, no rereads, 
and 6 rewrites. My knowledge of SMART is limited.  The other drive in this 
array shows 50000+ read errors corrected by ECC/fast, no rereads and no 
rewrites.

So, do the 6 rewrites on this drive is an indication of a looming failure? 
My second question is that the two drives are in a hot-swappable bay, and 
connected to the Adaptec AIC-7902B U320 controller.  Hardware-wise, the 
drives are hot-swappable, but what about software-wise?  If I take this 
drive entirely off RAID-1, cut the power to the hot-swap bay, pull the drive 
out, replace it, plug in back in, and reenable power, will the FC6 kernel be 
able to deal with this?

If I cannot do this, my third question is what do I need to do, grub-wise, 
to be able to swap sdb with sda?  sda is the one that's failing the RAID-1 
array.  If I can't hot-swap it, I'll need to replace it with the sdb drive, 
but right now grub is installed only on sda, so how do I install a copy of 
all the grub boot-related stuff on sdb?

Attachment:
pgpOZ3xkZxVN9.pgp

Description: PGP signature