So, in my personal experience with pending sectors, it's worth mentioning the following: If you do a "check", and you have any pending sectors that are within the partition that is used for the md device, they should be read and rewritten as needed, causing the count to go down. However, I've noticed that sometimes I have pending sector counts on drives that don't go away after a "check". These would go away, however, if I failed and then removed the drive with mdadm, and then subsequently zero filled the /entire/ drive (as opposed to just the partition on that disk that is used by the array). The reason for this is that there's a small chunk of unused space that never gets read or written to right after the partition (even though I technically partition the entire drive as one large partition (fd Linux raid auto). I think what actually happens in this case is that when the system reads data from near the end of the array, the drive itself will do read-ahead and cache it. So, even though the computer never requested those abandoned sectors, the drive eventually notices that it can't read them, and makes a note of the fact. So, this is harmless. You could probably avoid the potential for false-positive on pending sectors if you used the entire disk for the array (no partitions), but I'm pretty sure that breaks the raid auto-detection. Currently, my main array has 8 2TB hitachi disks, in a raid 6. It is scrubbed once a week, and one disk consistently has 8 pending sectors on it. I'm certain I could make those go away if I wanted, but, frankly, it's purely aesthetic as far as I'm concerned. Some of my drives also have non-zero "196 Reallocated_Event_Count" and "5 Reallocated_Sector_Ct", however, I have no drives with non-zero "Offline_Uncorrectable". I haven't had any problems with the disks or array (other than a temperature induced failure ... but that's another story, and I still run the same disks after that event). I used to have lots of issues before I started scrubbing consistently. Peter ----- Original Message ----- From: "Mathias Burén" <mathias.buren@xxxxxxxxx> To: "Alex" <mysqlstudent@xxxxxxxxx> Cc: "Mikael Abrahamsson" <swmike@xxxxxxxxx>, linux-raid@xxxxxxxxxxxxxxx Sent: Friday, November 4, 2011 10:43:07 AM Subject: Re: Impending failure? On 4 November 2011 15:31, Alex <mysqlstudent@xxxxxxxxx> wrote: > Hi, > >>> Can you point me to instructions on the best way to replace a disk? >> >> First run "repair" on the array, hopefully it'll notice the unreadable >> blocks and re-write them. >> >> echo repair >> /sys/block/md0/md/sync_action >> >> Also make sure your OS does regular scrubs of the raid, usually this is done >> by monthly runs of checkarray, this is an example from Ubuntu: > > Great, thanks. I recalled something like that, but couldn't remember exactly. > > The system passed the above rebuild test on both arrays, but I'm > obviously still concerned about the disk. Here are the relevant > smartctl lines: > > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 108 089 006 Pre-fail > Always - 0 > 3 Spin_Up_Time 0x0003 094 094 000 Pre-fail > Always - 0 > 4 Start_Stop_Count 0x0032 100 100 020 Old_age > Always - 29 > 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail > Always - 0 > 7 Seek_Error_Rate 0x000f 083 060 030 Pre-fail > Always - 209739855 > 9 Power_On_Hours 0x0032 074 074 000 Old_age > Always - 22816 > 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail > Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 020 Old_age > Always - 37 > 187 Reported_Uncorrect 0x0032 095 095 000 Old_age > Always - 5 > 189 High_Fly_Writes 0x003a 100 100 000 Old_age > Always - 0 > 190 Airflow_Temperature_Cel 0x0022 075 064 045 Old_age > Always - 25 (Min/Max 23/32) > 194 Temperature_Celsius 0x0022 025 040 000 Old_age > Always - 25 (0 18 0 0) > 195 Hardware_ECC_Recovered 0x001a 057 045 000 Old_age > Always - 51009302 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age > Always - 2 > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age > Offline - 2 > 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age > Always - 0 > 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age > Offline - 0 > 202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age > Always - 0 > > Pending_sector and uncorrectable are both greater than zero. Is this > drive on its way to failure? > > Can someone point me to the proper mdadm commands to set the drive > faulty then rebuild it after installing the new one? > > Thanks again, > Alex > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > 187 Reported_Uncorrect 0x0032 095 095 000 Old_age Always - 5 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 2 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 2 This tells me to get rid of the drive. I don't know the mdadm commands from my head, sorry, but it's in the man page(s). If you want you run a scrub and see if these numbers change. If the drive fails hard enough then md will kick it out of the array anyway. Btw, I scrub my RAID6 (7 HDDs) once a week. /Mathias -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html