Re: smart short test crashes software raid array?

Wols Lists <antlists@xxxxxxxxxxxxxxx> · Sun, 10 Mar 2019 15:10:44 +0000

On 10/03/19 11:14, Reindl Harald wrote:
> 
> 
> Am 10.03.19 um 10:55 schrieb Andy Smith:
> 
> 
>> On Sat, Mar 09, 2019 at 11:53:22PM +0100, Reindl Harald wrote:
>>> Am 09.03.19 um 23:32 schrieb Wols Lists:
>>>> Well, my first take on that is that they are NOT raid-quality drives!!!
>>>
>>> when i hear such shit i frankly could puke!
>>
>> I suspect that Wols means that these drive models cannot set SCTERC so
>> will retry for a very long time, requiring the block layer timeouts to
>> be set to 2+ minutes or else risk drive being kicked out by the kernel
>> whenever there is a minor problem.
>>
>> If so, this a factual thing, in that manufacturers really did produce
>> drives that are sub-optimal for RAID.
> 
> no, the problem is that you need to change that timeouts because of bad
> defaults
> 
Which is why I said in my original email that you need to make sure the
timeout script runs ...

These drives *are* sub-optimal in that (a) they are unfit for raid use
"out of the box", and (b) they cannot be configured suitably for such use.

You have to muck about with the OS *every* *boot*, and the changes are
such that if there is a problem the machine will appear to hang because
it takes something like two to three minutes to sort itself out. This is
painful on a desktop, and intolerable on a server, if your process hangs
that long waiting for a read to complete.

For a backup server where online access is not important, these drives
are okay. For systems where you have users expecting a fast response,
they are not.

I'd like to modify the raid layer such that it times out quickly, and
recalculates and rewrites the data after a few seconds, such that these
drives cease to be a problem, but stick that on the long list of raid
papercuts I'd like to sort out when I can find the time to learn to
program the raid subsystem!

Cheers,
Wol