On 9 February 2010 06:13, Robert Hancock <hancockrwd@xxxxxxxxx> wrote: > On 02/08/2010 05:11 AM, Håkon Løvdal wrote: >> ---BEGIN log-4--- >> Feb 6 07:09:57 localhost kernel: ata8.00: exception Emask 0x0 SAct >> 0x0 SErr 0x0 action 0x0 >> Feb 6 07:09:57 localhost kernel: ata8.00: BMDMA2 stat 0x6c0009 >> Feb 6 07:09:57 localhost kernel: ata8.00: cmd >> 25/00:80:cf:cd:69/00:00:2f:00:00/e0 tag 0 dma 65536 in >> Feb 6 07:09:57 localhost kernel: res >> 51/40:00:e4:cd:69/00:00:2f:00:00/e0 Emask 0x9 (media error) >> Feb 6 07:09:57 localhost kernel: ata8.00: status: { DRDY ERR } >> Feb 6 07:09:57 localhost kernel: ata8.00: error: { UNC } > > That's fairly definitive, uncorrected read error reported by the drive. You > might want to check its SMART status. Could be a bad drive, or potentially > other causes like excessive vibration, high temperature, power issues.. For all of sdb, sdc, sdd, sde, sdf and sdg they all have had a normalized value of 100 for the whole lifetime of the disk (I have a cron job to capture output from smartctl nightly for reference and have now checked those files) for all the critical attributes listed at http://en.wikipedia.org/wiki/S.M.A.R.T.#ATA_S.M.A.R.T._attributes 1 Raw_Read_Error_Rate 5 Reallocated_Sector_Ct 10 Spin_Retry_Count 184 Unknown_Attribute 188 Unknown_Attribute 196 Reallocated_Event_Count 197 Current_Pending_Sector 198 Offline_Uncorrectable 201 Soft_Read_Error_Rate except for Soft_Read_Error_Rate which switches between either 100 or 253. The disks are now placed in a Image Shapetek EYE-981SC tower[1] with good space, and the disks are placed in 5.25" bays with rubber hard disk stabilizers[1] to reduce vibration. There is therefore good airflow around all the disks, and I keep one side of the tower case open, so temperature should not be a problem (any longer). In the previous case space could be more tight. I see that last summer hde and hdf had temperatures of around 45-55°C in June/July which does not sound too good[3]. They are still part of the raid, whereas hdc which has an excellent temperature profile of 35-45°C and hdd (28-38) are the two disks being currently kicked out of the rad. There might be some issues with the PSU[4] (I am waiting for a new one). I doubt there are any problem with line electricity because the quality is generally quite good here in Norway and besides the machine is behind an UPS. smartctl -l selftest /dev/sde gives Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 795 1465145815 # 2 Conveyance offline Completed: read failure 90% 794 1465145815 # 3 Offline Completed: read failure 00% 790 1465145815 # 4 Short offline Completed: read failure 20% 787 1465145815 None of the other disks report any selftest failures. So sde and sdf show some sign of trouble (temperature, selftest and ata8.00 exception above), but they are not kicked out of the raid. On the other hand sdc and sdd are both kicked out and I cannot see any obvious signs of hardware trouble here. Any suggestions? BR Håkon Løvdal [1] http://translate.google.com/translate?js=y&prev=_t&hl=en&ie=UTF-8&layout=1&eotf=1&u=http%3A%2F%2Fwww.hardware.no%2Fartikler%2Fi_s_981_servertower%2F46558%2Futskrift&sl=no&tl=en [2] http://www.scythe-eu.com/en/products/pc-accessory/hard-disk-stabilizer-2.html [3] http://en.wikibooks.org/wiki/Minimizing_hard_disk_drive_failure_and_data_loss#Temperature_control [4] 350W, Point of view, VP-3504 -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html