Re: smart short test crashes software raid array?

Nix <nix@xxxxxxxxxxxxx> · Mon, 11 Mar 2019 12:31:33 +0000

On 10 Mar 2019, Wols Lists uttered the following:

> I'd like to modify the raid layer such that it times out quickly, and
> recalculates and rewrites the data after a few seconds, such that these
> drives cease to be a problem, but stick that on the long list of raid
> papercuts I'd like to sort out when I can find the time to learn to
> program the raid subsystem!

I don't see how that could work. When these drives get stuck on lengthy
retries, they are essentially unresponsive: often you can't even tell
them to abort the command (and on some old drives they even locked the
PCI bus up while they did it, though I haven't seen *that* for a very
long time and it's probably impossible on PCIe anyway). If the drive is
unresponsive while they retry, that means you can't ask it to write
other data instead: it's unresponsive! You'd need to schedule a rewrite
for later, somehow (not using the usual block-layer queueing, because if
the drive times out and gets reset I'm not sure what happens to the
contents of those queues).

-- 
NULL && (void)