Re: [RFC][PATCH] Introduce the parameter to limit scsi timeout count

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Mon, 01 Jun 2009 20:02:38 +0000

On Mon, 2009-06-01 at 15:15 -0400, Takahiro Yasui wrote:
> Hi,
> 
> I would like to solve an issue related to scsi timeout.
> 
> A storage can break down in the way that it does not respond to
> scsi commands such as read/write, while a storage successfully
> respond to scsi commands such as test unit ready.
> (It may depend on implementation of storage.)
> 
> When this type of a device trouble happens, the scsi-mid layer
> detects timeout for the device and the scsi-mid layer tries to
> recover the error. Then, scsi-mid layer detects that the device
> has been recovered by the result of Test Unit Ready.
> 
> Therefore, the state of the device is not changed to offline
> and user application can continue to issue I/Os to the device.
> This may cause timeout errors repeatedly on the same device,
> and application can not do proper actions quickly.
> 
> To solve this issue, let me propose the sysfs parameter to
> limit scsi timeout count in scsi-mid layer. This parameter
> is tunable as a module parameter to address the issue at
> system boot.
> 
> * example
> 
>  - Limit a scsi timout count to 1
>     # echo 1 > /sys/block/<sdX>/device/max_timeout_cnt
> 
>  - Display a current timeout count
>     # cat /sys/block/<sdX>/device/iotimeout_cnt
> 
>  - Load scsi module with a default scsi timeout count (5)
>     # insmod scsi_mod.ko max_timeout_count=5
> 
> I appreciate your comments and suggestions.

It doesn't really look like a good solution to the problem you're
describing, particularly if it's just a few isolated arrays.

The code you propose would certainly catch things like usb devices which
are known for random timeouts; plus a lot of SCSI/ATA devices suffer
isolated timeouts because of I/O load.  Global code like this could end
up offlining them.

Which arrays are these, and what's the taxonomy of the failure ... if
TUR succeeds, perhaps there's another command for the arrays we could
send that would fail or timeout ... or perhaps there's a different way
they should be recovered.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html