Re: SSD based sw RAID: is ERC/TLER really important?

Phil Turmel <philip@xxxxxxxxxx> · Sat, 24 Jul 2021 17:45:37 -0400

On 7/24/21 4:19 PM, Peter Grandi wrote:

the recovery time in case of media errors could exceed kernel 
timeouts and possibly kick off the entire drive from the RAID set
and, in turn, lead to a fault of a RAID5 system upon a subsequent
error in a second drive.

My understanding seems different:

You understanding is incorrect.

* The purpose of having a short device error retry period is the 
opposite, it is to fail a drive as fast as possible, in workloads
where latency matters ( or there is also the risk of bus/link resets
hitting multiple drives). In those cases error retry periods of 1-2
seconds (at most) are common, rather than the mid-way "7 seconds"
from copy-and-paste from web pages..

Yes, the short ERC setting helps latency, but the primary purpose is to
be shorter than the kernel timeout.

* The purpose of having a long device error retry is to instead to
minimize the chances of declaring a drive failed, hoping that many
retries succeed. (but note the difference between reads and writes).

Read errors do *not* kick drives out.  It takes several read errors in a
short time to fail a drive out of an array.

A drive not responding before the kernel timeout *will* get it kicked, 
though.  Because the kernel giving up propagates to the raid as a
read error (while the drive is off in la-la land) which then causes
the raid to *reconstruct* the missing sector and *write* it.  Along
with passing the reconstructed data up the chain.

That write will fail because the drive is still in la-la land.  Any
write failure *does* kick the drive out.

* It is possible to set the kernel timeouts higher than device retry
periods, if one does not care about latency, to minimize the chances
of declaring a drive failed (not the difference between Linux command
timeouts and retry timeouts, the latter can also be long).

But in the case of SSD drives (where, possibly, the error recovery
activities performed by the drive firmware are very fast) [...]

I guess that depends on the firmware: On one hand MLC cells can 
become quite unreliable, especially at higher temperatures, requiring
many retries and lots of ECC, on the other on "write" allocating a
new erase-block is easy, as unlike for most HDDs with a FTL, SDD
sector logical and physical sector locations are independent.
Unfortunately most flash SSD drive makers don't supply technical
information on details like error recovery strategies.

I don't have data on SSD behavior without ERC.  If their retry cycle is 
exhausted within the kernel default 30 seconds, the timeout mismatch 
issue will *not* apply.

Phil