Re: SSD based sw RAID: is ERC/TLER really important?

pg@xxxxxxxxxxxxxxxxxxxx (Peter Grandi) · Sun, 25 Jul 2021 12:28:17 +0200

>> * The purpose of having a long device error retry is to instead to
>> minimize the chances of declaring a drive failed, hoping that many
>> retries succeed. (but note the difference between reads and writes).
>> * It is possible to set the kernel timeouts higher than device retry
>> periods, if one does not care about latency, to minimize the
>> chances of declaring a drive failed (not[e] the difference
>> between Linux command timeouts and retry timeouts, the latter
>> can also be long).

> You understanding is incorrect.
> Read errors do *not* kick drives out. It takes several read
> errors in a short time to fail a drive out of an array.

I am sorry that I was not clear enough and therefore:

* You failed to understand the relevance of "note the difference
  between reads and writes" which I added precisely because I
  guessed that someone unfamiliar with storage device would need
  that terse qualifier.

* You failed to understand the relevance of the "to minimize the
  chances of declaring a drive failed".

* You failed to realize that I was addressing tersely the
  original poster's case of a drive being declared failed
  because of a drive timeout longer than the kernel command
  timeout, without going in detail about all other possible
  cases.