Re: [PATCH] raid456: avoid second retry of read-error

Wols Lists <antlists@xxxxxxxxxxxxxxx> · Wed, 6 Nov 2019 16:52:01 +0000

On 05/11/19 22:46, Nigel Croxon wrote:
> 
> On 11/4/19 7:33 PM, Wols Lists wrote:
>> On 04/11/19 20:01, Nigel Croxon wrote:
>>> The MD driver for level-456 should prevent re-reading read errors.
>>>
>>> For redundant raid it makes no sense to retry the operation:
>>> When one of the disks in the array hits a read error, that will
>>> cause a stall for the reading process:
>>> - either the read succeeds (e.g. after 4 seconds the HDD error
>>> strategy could read the sector)
>>> - or it fails after HDD imposed timeout (w/TLER, e.g. after 7
>>> seconds (might be even longer)
>> Okay, I'm being completely naive here, but what is going on? Are you
>> saying that if we hit a read error, we just carry on, ignore it, and
>> calculate the missing block from parity?
>>
>> If so, what happens if we hit two errors on a raid-5, or 3 on a raid-6,
>> or whatever ... :-)
>>
>> Cheers,
>> Wol
> 
> This allows the device (disk) to fail faster.  All logic is the same.
> 
> If there is a read error, it does not retry that read, it calculates
> 
> the data from the other disks.  This patch removes the retry.
> 
Ummm ...

I suspect there is a very good reason for the retry ...

Bear in mind I don't actually KNOW anything, you'll need to check with
someone who knows about these things, but I get the impression that
transient errors aren't that uncommon. It fails, you try again, it succeeds.

So if you're going to go down that route, by all means re-calculate from
parity if ONE read fails, but if you get more failures such that the
raid fails, you need to retry those reads because there is a good chance
they will succeed second time round.

Cheers,
Wol