Re: What to do about Offline_Uncorrectable and Pending_Sector in RAID1

Phil Turmel <philip@xxxxxxxxxx> · Mon, 14 Nov 2016 11:10:17 -0500

On 11/14/2016 11:03 AM, Bruce Merry wrote:
> On 14 November 2016 at 17:58, Wols Lists <antlists@xxxxxxxxxxxxxxx> wrote:
>> On 14/11/16 15:52, Bruce Merry wrote:
>>> On 13 November 2016 at 23:06, Wols Lists <antlists@xxxxxxxxxxxxxxx> wrote:
>>>>> Sounds like that drive could need replacing. I'd get a new drive and do
>>>>> that as soon as possible - use the --replace option of mdadm - don't
>>>>> fail the old drive and add the new.
>>> Would you mind explaining why I should use --replace instead of taking
>>> out the suspect drive? I guess I lose redundancy for any writes that
>>> occur while the rebuild is happening, but I'd plan to do this with the
>>> filesystem unmounted so there wouldn't be any writes.
>>
>> Because a replace will copy from the old drive to the new, recovering
>> any failures from the rest of the array. A fail-and-add will have to
>> rebuild the entire new array from what's left of the old, stressing the
>> old array much more.

I entirely endorse Anthony's advice on this one.  You are at great risk
of not completing a fail/add resync with the new drive.

> Okay, I can see how for RAID5 that might be a bad thing.
> 
> In my case however, it sounds like --replace will copy everything from
> the failing drive, whereas I'd rather it copied everything from the
> good drive. Same stress on the array, less chance of copying dodgy
> data.

You simply don't have that choice, sorry.  And drives returning dodgy
data is ungodly rare.  The sector checksum algorithms are that good.
You have a URE crisis in your array that is far more significant.

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html