increasing scsi command timeout vs implementing ERC for drives performing deep error recovery in an array

"." <desire@xxxxxxxxx> · Wed, 18 Apr 2012 19:05:04 +0800

I'm trying to understand what causes a drive performing a deep
recovery cycle to get kicked out of a linux software raid array, and
whether setting an ERC limit is the only option, or whether increasing
the scsi command timeout is a reasonable alternative.

As I understand it, if the deep recovery goes on for long enough, the
scsi command timeout would be exceeded. This causes the SCSI error
handler to attempt to abort the command and reset the device/bus/host.
If these error handlers fail, the drive is set offline (which I assume
is what kicks the drive out).

ERC helps in this scenario as the drive will return an error before
the timeout is exceeded.  The scsi layer will return an error to the
md/raid layer, which can take the appropriate action (retry operation
/ recover data from redundant source and rewrite it / kick disk or
whatever).

I have also read that the SCSI command timeout can be tuned via
/sys/block/.../device/timeout, and defaults to 30 seconds.  Would
raising this timeout to a large value likewise prevent deep recovery
cycles from causing the SCSI layer to set the drive offline?  Does
anyone know what is the maximum time taken for a deep recovery cycle?

Or, might it be a situation where there will be lots of commands
queued behind the access to the bad sector, and increasing the scsi
command timeout would only help with the first command, and the rest
of the queued commands will be exponentially delayed such that it is
not feasible to avoid this by increasing the timeout value?

Appreciate your comments and corrections if I've made mistaken
assumptions above.  Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html