Re: SCSI error handling -- one error blocks the whole SCSI host

Baruch Even <baruch@xxxxxxxxx> · Tue, 28 May 2013 19:22:41 +0300

On Tue, May 28, 2013 at 5:38 PM, Jeremy Linton <jlinton@xxxxxxxxxxxxx> wrote:
>         This is another part of what formed my opinions about error isolation. If one
> of your devices goes out to lunch and isn't recovering via abort/lun reset.
> Its done! Wrecking the rest of the SAN doing "bus resets" and HBA resets is a
> good way to take a serious problem and turn it into a full blown catastrophe.

This is the gist of the issue, once you got to an abort you are screwed already.
You need the abort but anything else should be reserved to when things
are really
dead (the HBA might still recover on a host reset, but only do it if the host is
really unresponsive).

That's why I prefer to have a long timeout for the command and a long
timeout for
the abort. The application above should handle itself with its own
timeout once the
abort was sent (the buffer remains locked until the abort returns).
The device itself
is likely stuck in error recovery and it will come out of it when its
own internal
timeouts are exhausted which can be infinite and will generally be very large.

Baruch
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html