Re: [PATCH v2] scsi: Add 'retry_timeout' to avoid infinite command retry

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Thu, 06 Feb 2014 21:46:43 -0800

On Fri, 2014-02-07 at 09:22 +0900, Eiichi Tsukata wrote:
> Currently, scsi error handling in scsi_io_completion() tries to
> unconditionally requeue scsi command when device keeps some error state.
> For example, UNIT_ATTENTION causes infinite retry with
> action == ACTION_RETRY.
> This is because retryable errors are thought to be temporary and the scsi
> device will soon recover from those errors. Normally, such retry policy is
> appropriate because the device will soon recover from temporary error state.

> But there is no guarantee that device is able to recover from error state
> immediately. Actually, we've experienced an infinite retry on some hardware.
> Therefore hardware error can results in infinite command retry loop.

Could you please add an analysis of the actual failure; which devices
and what conditions.

> This patch adds 'retry_timeout' sysfs attribute which limits the retry time
> of each scsi command. This attribute is located in scsi sysfs directory
> for example "/sys/bus/scsi/devices/X:X:X:X/" and value is in seconds.
> Once scsi command retry time is longer than this timeout,
> the command is treated as failure. 'retry_timeout' is set to '0' by default
> which means no timeout set.

Don't do this ... you're mixing a feature (which you'd need to justify)
with an apparent bug fix.

Once you dump all the complexity, I think the patch boils down to a
simple check before the action switch in scsi_io_completion():

	if (action !=  ACTION_FAIL &&
	    time_before(cmd->jiffies_at_alloc + wait_for, jiffies)) {
		action = ACTION_FAIL;
		description = "command timed out";
	}

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html