Re: "raid5:md0: read error not correctable (sector 795463080 on sdf1)" error on controller with SIL 3114

Håkon Løvdal <hlovdal@xxxxxxxxx> · Wed, 17 Feb 2010 03:42:45 +0100

On 9 February 2010 06:13, Robert Hancock <hancockrwd@xxxxxxxxx> wrote:
> On 02/08/2010 05:11 AM, Håkon Løvdal wrote:
>> ---BEGIN log-4---
>> Feb  6 07:09:57 localhost kernel: ata8.00: exception Emask 0x0 SAct
>> 0x0 SErr 0x0 action 0x0
>> Feb  6 07:09:57 localhost kernel: ata8.00: BMDMA2 stat 0x6c0009
>> Feb  6 07:09:57 localhost kernel: ata8.00: cmd
>> 25/00:80:cf:cd:69/00:00:2f:00:00/e0 tag 0 dma 65536 in
>> Feb  6 07:09:57 localhost kernel:         res
>> 51/40:00:e4:cd:69/00:00:2f:00:00/e0 Emask 0x9 (media error)
>> Feb  6 07:09:57 localhost kernel: ata8.00: status: { DRDY ERR }
>> Feb  6 07:09:57 localhost kernel: ata8.00: error: { UNC }
>
> That's fairly definitive, uncorrected read error reported by the drive. You
> might want to check its SMART status. Could be a bad drive, or potentially
> other causes like excessive vibration, high temperature, power issues..

For all of sdb, sdc, sdd, sde, sdf and sdg they all have had a
normalized value of 100 for the whole lifetime of the disk (I have
a cron job to capture output from smartctl nightly for reference
and have now checked those files) for all the critical attributes listed at
http://en.wikipedia.org/wiki/S.M.A.R.T.#ATA_S.M.A.R.T._attributes
  1 Raw_Read_Error_Rate
  5 Reallocated_Sector_Ct
 10 Spin_Retry_Count
184 Unknown_Attribute
188 Unknown_Attribute
196 Reallocated_Event_Count
197 Current_Pending_Sector
198 Offline_Uncorrectable
201 Soft_Read_Error_Rate
except for Soft_Read_Error_Rate which switches between either 100 or 253.

The disks are now placed in a Image Shapetek EYE-981SC tower[1] with good space,
and the disks are placed in 5.25" bays with rubber hard disk stabilizers[1] to
reduce vibration. There is therefore good airflow around all the
disks, and I keep
one side of the tower case open, so temperature should not be a
problem (any longer).

In the previous case space could be more tight. I see that last summer
hde and hdf had temperatures of around 45-55°C in June/July which does not
sound too good[3]. They are still part of the raid, whereas hdc which has
an excellent temperature profile of 35-45°C and hdd (28-38) are the two
disks being currently kicked out of the rad.

There might be some issues with the PSU[4] (I am waiting for a new one). I doubt
there are any problem with line electricity because the quality is
generally quite
good here in Norway and besides the machine is behind an UPS.

smartctl -l selftest /dev/sde gives
    Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
    # 1  Extended offline    Completed: read failure       90%
795         1465145815
    # 2  Conveyance offline  Completed: read failure       90%
794         1465145815
    # 3  Offline             Completed: read failure       00%
790         1465145815
    # 4  Short offline       Completed: read failure       20%
787         1465145815
None of the other disks report any selftest failures.

So sde and sdf show some sign of trouble (temperature, selftest and ata8.00
exception above), but they are not kicked out of the raid. On the other hand
sdc and sdd are both kicked out and I cannot see any obvious signs of hardware
trouble here. Any suggestions?

BR Håkon Løvdal

[1]
http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&u=http%3A%2F%2Fwww.hardware.no%2Fartikler%2Fi_s_981_servertower%2F46558%2Futskrift&sl=no&tl=en

[2]
http://www.scythe-eu.com/en/products/pc-accessory/hard-disk-stabilizer-2.html

[3]
http://en.wikibooks.org/wiki/Minimizing_hard_disk_drive_failure_and_data_loss#Temperature_control

[4]
350W, Point of view, VP-3504
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html