Re: [PATCH] scsi: Allow error handling timeout to be specified

Baruch Even <baruch@xxxxxxxxx> · Fri, 10 May 2013 16:22:19 +0300

On Fri, May 10, 2013 at 3:43 PM, Ewan Milne <emilne@xxxxxxxxxx> wrote:
>
> On Thu, 2013-05-09 at 23:11 -0400, Martin K. Petersen wrote:
> > Introduce eh_timeout which can be used for error handling purposes. This
> > was previously hardcoded to 10 seconds in the SCSI error handling
> > code. However, for some fast-fail scenarios it is necessary to be able
> > to tune this as it can take several iterations (bus device, target, bus,
> > controller) before we give up.
> >
> > Signed-off-by: Martin K. Petersen <martin.petersen@xxxxxxxxxx>
> >
>
> Thanks for posting this.  It will be very helpful to have this
> capability, particularly when alternate paths to the device exist.
>
> Acked-by: Ewan D. Milne <emilne@xxxxxxxxxx>

I would argue that waiting for the eh to timeout before you switch to
another path is most likely to be wrong. If you did the first pass of
error recovery (task abort) and that failed the
path/hba/logical-device is doomed. If you will switch to another path
it will either work (meaning the path/hba were bad) or not (logical
device was the culprit).

Actually reducing the timeouts is probably not a good approach since
it will cause the host to take a more radical approach without waiting
sufficiently for a potential recovery. In addition the more radical
error handlings such as host reset will destroy other paths for
completely unrelated devices/links, from my experience a host reset is
usually not required and the Linux kernel currently reaches to this
big hammer too fast.

Not that I have any qualms about the patch itself, I've been down this
path myself and was proven wrong by real life. Though my experience
was mostly on the SAS network rather than the FC network.

Baruch
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html