Re: SSD based sw RAID: is ERC/TLER really important?

pg@xxxxxxxxxxxxxxxxxxx (Peter Grandi) · Sat, 24 Jul 2021 22:19:02 +0200

> the recovery time in case of media errors could exceed kernel
> timeouts and possibly kick off the entire drive from the RAID
> set and, in turn, lead to a fault of a RAID5 system upon a
> subsequent error in a second drive.

My understanding seems different:

* The purpose of having a short device error retry period is the
  opposite, it is to fail a drive as fast as possible, in
  workloads where latency matters ( or there is also the risk of
  bus/link resets hitting multiple drives). In those cases error
  retry periods of 1-2 seconds (at most) are common, rather than
  the mid-way "7 seconds" from copy-and-paste from web pages..

* The purpose of having a long device error retry is to instead
  to minimize the chances of declaring a drive failed, hoping
  that many retries succeed. (but note the difference between
  reads and writes).

* It is possible to set the kernel timeouts higher than device
  retry periods, if one does not care about latency, to minimize
  the chances of declaring a drive failed (not the difference
  between Linux command timeouts and retry timeouts, the latter
  can also be long).

> But in the case of SSD drives (where, possibly, the error
> recovery activities performed by the drive firmware are very
> fast) [...]

I guess that depends on the firmware: On one hand MLC cells can
become quite unreliable, especially at higher temperatures,
requiring many retries and lots of ECC, on the other on "write"
allocating a new erase-block is easy, as unlike for most HDDs
with a FTL, SDD sector logical and physical sector locations are
independent. Unfortunately most flash SSD drive makers don't
supply technical information on details like error recovery
strategies.