Re: Impending failure?

Peter Zieba <pzieba@xxxxxxxxxxxxxxxxx> · Fri, 4 Nov 2011 12:42:31 -0500 (CDT)

So, in my personal experience with pending sectors, it's worth mentioning the following:

If you do a "check", and you have any pending sectors that are within the partition that is used for the md device, they should be read and rewritten as needed, causing the count to go down. However, I've noticed that sometimes I have pending sector counts on drives that don't go away after a "check". These would go away, however, if I failed and then removed the drive with mdadm, and then subsequently zero filled the /entire/ drive (as opposed to just the partition on that disk that is used by the array). The reason for this is that there's a small chunk of unused space that never gets read or written to right after the partition (even though I technically partition the entire drive as one large partition (fd  Linux raid auto).

I think what actually happens in this case is that when the system reads data from near the end of the array, the drive itself will do read-ahead and cache it. So, even though the computer never requested those abandoned sectors, the drive eventually notices that it can't read them, and makes a note of the fact. So, this is harmless.

You could probably avoid the potential for false-positive on pending sectors if you used the entire disk for the array (no partitions), but I'm pretty sure that breaks the raid auto-detection.

Currently, my main array has 8 2TB hitachi disks, in a raid 6. It is scrubbed once a week, and one disk consistently has 8 pending sectors on it. I'm certain I could make those go away if I wanted, but, frankly, it's purely aesthetic as far as I'm concerned. Some of my drives also have non-zero "196 Reallocated_Event_Count" and "5 Reallocated_Sector_Ct", however, I have no drives with non-zero "Offline_Uncorrectable". I haven't had any problems with the disks or array (other than a temperature induced failure ... but that's another story, and I still run the same disks after that event). I used to have lots of issues before I started scrubbing consistently.

Peter

----- Original Message -----
From: "Mathias Burén" <mathias.buren@xxxxxxxxx>
To: "Alex" <mysqlstudent@xxxxxxxxx>
Cc: "Mikael Abrahamsson" <swmike@xxxxxxxxx>, linux-raid@xxxxxxxxxxxxxxx
Sent: Friday, November 4, 2011 10:43:07 AM
Subject: Re: Impending failure?

On 4 November 2011 15:31, Alex <mysqlstudent@xxxxxxxxx> wrote:
> Hi,
>
>>> Can you point me to instructions on the best way to replace a disk?
>>
>> First run "repair" on the array, hopefully it'll notice the unreadable
>> blocks and re-write them.
>>
>> echo repair >> /sys/block/md0/md/sync_action
>>
>> Also make sure your OS does regular scrubs of the raid, usually this is done
>> by monthly runs of checkarray, this is an example from Ubuntu:
>
> Great, thanks. I recalled something like that, but couldn't remember exactly.
>
> The system passed the above rebuild test on both arrays, but I'm
> obviously still concerned about the disk. Here are the relevant
> smartctl lines:
>
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
>  1 Raw_Read_Error_Rate     0x000f   108   089   006    Pre-fail
> Always       -       0
>  3 Spin_Up_Time            0x0003   094   094   000    Pre-fail
> Always       -       0
>  4 Start_Stop_Count        0x0032   100   100   020    Old_age
> Always       -       29
>  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail
> Always       -       0
>  7 Seek_Error_Rate         0x000f   083   060   030    Pre-fail
> Always       -       209739855
>  9 Power_On_Hours          0x0032   074   074   000    Old_age
> Always       -       22816
>  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail
> Always       -       0
>  12 Power_Cycle_Count       0x0032   100   100   020    Old_age
> Always       -       37
> 187 Reported_Uncorrect      0x0032   095   095   000    Old_age
> Always       -       5
> 189 High_Fly_Writes         0x003a   100   100   000    Old_age
> Always       -       0
> 190 Airflow_Temperature_Cel 0x0022   075   064   045    Old_age
> Always       -       25 (Min/Max 23/32)
> 194 Temperature_Celsius     0x0022   025   040   000    Old_age
> Always       -       25 (0 18 0 0)
> 195 Hardware_ECC_Recovered  0x001a   057   045   000    Old_age
> Always       -       51009302
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age
> Always       -       2
> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
> Offline      -       2
> 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age
> Always       -       0
> 200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age
> Offline      -       0
> 202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age
> Always       -       0
>
> Pending_sector and uncorrectable are both greater than zero. Is this
> drive on its way to failure?
>
> Can someone point me to the proper mdadm commands to set the drive
> faulty then rebuild it after installing the new one?
>
> Thanks again,
> Alex
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

187 Reported_Uncorrect      0x0032   095   095   000    Old_age
Always       -       5
197 Current_Pending_Sector  0x0012   100   100   000    Old_age
Always       -       2
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
Offline      -       2

This tells me to get rid of the drive. I don't know the mdadm commands
from my head, sorry, but it's in the man page(s). If you want you run
a scrub and see if these numbers change. If the drive fails hard
enough then md will kick it out of the array anyway. Btw, I scrub my
RAID6 (7 HDDs) once a week.

/Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html