Re: smart short test crashes software raid array?

Wols Lists <antlists@xxxxxxxxxxxxxxx> · Tue, 12 Mar 2019 08:09:23 +0000

On 11/03/19 22:37, Adam Goryachev wrote:
> On 12/3/19 5:14 am, Wols Lists wrote:
>> On 11/03/19 12:31, Nix wrote:
>>> On 10 Mar 2019, Wols Lists uttered the following:
>>>
>>>>> I'd like to modify the raid layer such that it times out quickly, and
>>>>> recalculates and rewrites the data after a few seconds, such that
>>>>> these
>>>>> drives cease to be a problem, but stick that on the long list of raid
>>>>> papercuts I'd like to sort out when I can find the time to learn to
>>>>> program the raid subsystem!
>>> I don't see how that could work. When these drives get stuck on lengthy
>>> retries, they are essentially unresponsive:
>> So any code needs to take that in to account. Pain in the arse, but when
>> the linux read times out, the re-write code needs to detect that the
>> drive is one of these cheapos, and spawn a thread that waits for the
>> drive time-out before rewriting it.
>>
>> Of course, that's going to cause a host of other issues that will need
>> sorting/fixing :-) - the obvious one is what happens if something else
>> re-writes that block in the middle of the time-out period ...
>>
>> Cheers,
>> Wol
> 
> Doesn't this happen already? The drive will either return the data (if
> it magically succeeds in reading the requested data in that 180?
> seconds, or it will return a read error.

But that's the whole point - THAT IS UNACCEPTABLE.

What I would like to make happen is that

1) Linux issues a read request ...

we have a read error so

2) Linux times out after 7 seconds

3) The raid code computes the missing block and passes it back to the user

4) The raid code spots that the disk has a 180 timeout *so it waits*

5) The block is rewritten.

You're missing the point that that 180s wait really f***s things up for
people, and/or they don't realise that there's a problem until they hit it.

My solution is a very good fix apart from the fact that step 4 is a pile
of spaghetti waiting to cause havoc ... :-)

Cheers,
Wol