Re: [PATCH] scsi: Allow error handling timeout to be specified

Baruch Even <baruch@xxxxxxxxx> · Fri, 10 May 2013 20:55:35 +0300

On Fri, May 10, 2013 at 5:53 PM, Martin K. Petersen
<martin.petersen@xxxxxxxxxx> wrote:
>>>>>> "Baruch" == Baruch Even <baruch@xxxxxxxxx> writes:
>
> Baruch> Actually reducing the timeouts is probably not a good approach
> Baruch> since it will cause the host to take a more radical approach
> Baruch> without waiting sufficiently for a potential recovery.
>
> Reducing the eh timeout is a requirement in many clustered setups. We've
> been shipping a predecessor to this patch in our kernels for a long
> time.

> Baruch> In addition the more radical error handlings such as host reset
> Baruch> will destroy other paths for completely unrelated devices/links,
> Baruch> from my experience a host reset is usually not required and the
> Baruch> Linux kernel currently reaches to this big hammer too fast.
>
> I'm also working on a patch to add some heuristics to avoid the HBA and
> bus resets if I/O is completing successfully on other attached
> targets. But that's an orthogonal issue.

Why?

In my experience (again, SAS based inside a storage device) the
reduced eh timeout is more likely to cause escalated problems rather
than resolve the issue.

I actually find that the higher level should have a small timeout of
its own to do its own recovery work, which normally entails going to
other copies of the data where available and let the device try to get
the IO done if possible. Not sure how applicable it is to the kernel
itself but I do feel it could be relevant.

Baruch
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html