> the recovery time in case of media errors could exceed kernel > timeouts and possibly kick off the entire drive from the RAID > set and, in turn, lead to a fault of a RAID5 system upon a > subsequent error in a second drive. My understanding seems different: * The purpose of having a short device error retry period is the opposite, it is to fail a drive as fast as possible, in workloads where latency matters ( or there is also the risk of bus/link resets hitting multiple drives). In those cases error retry periods of 1-2 seconds (at most) are common, rather than the mid-way "7 seconds" from copy-and-paste from web pages.. * The purpose of having a long device error retry is to instead to minimize the chances of declaring a drive failed, hoping that many retries succeed. (but note the difference between reads and writes). * It is possible to set the kernel timeouts higher than device retry periods, if one does not care about latency, to minimize the chances of declaring a drive failed (not the difference between Linux command timeouts and retry timeouts, the latter can also be long). > But in the case of SSD drives (where, possibly, the error > recovery activities performed by the drive firmware are very > fast) [...] I guess that depends on the firmware: On one hand MLC cells can become quite unreliable, especially at higher temperatures, requiring many retries and lots of ECC, on the other on "write" allocating a new erase-block is easy, as unlike for most HDDs with a FTL, SDD sector logical and physical sector locations are independent. Unfortunately most flash SSD drive makers don't supply technical information on details like error recovery strategies.