Re: [PATCH] scsi: Allow error handling timeout to be specified

Baruch Even <baruch@xxxxxxxxx> · Mon, 13 May 2013 19:50:42 +0300

On Mon, May 13, 2013 at 6:58 PM, Jeremy Linton <jlinton@xxxxxxxxxxxxx> wrote:
> On 5/13/2013 10:03 AM, Hannes Reinecke wrote:
>> The other LUNs haven't reported an error. But how do you know whether they
>> are still okay? The other LUNs might simply be idle, and no commands have
>> been send to them.
>
>         Well, how about generating std inquiry against them if they are idle and the
> given HBA has a device in error state? Then you can make a rough approximation
> of what has failed, and escalate the error handling if all the devices at a
> particular level have failed.
>
>         The midlayer may not even need to send the inquiries. If the individual
> device drivers (sd/st/etc) are responsible for monitoring and error recovery
> then they can be tasked with determining device availability as well. I think
> this solves other problems too. For example, the use of TUR in the midlayer,
> is a problem because it doesn't have enough knowledge about the possible check
> conditions being returned to act on them appropriately.

Such an approach is preferable IMO than the big hammer, especially if
we are talking about a likely condition of using multipath and having
other links over the same host  that do have traffic flowing through
them. If there is traffic already on the same host there is no reason
to do a host reset, if there is no traffic and there are no other
luns, go for the big gun it will not matter to anything else, if there
are other inactive luns some mechanism to trigger some basic traffic
(inquiry/tur) on them is much preferable to just a plain big hammer
application.

It might be that the kernel is not the right place for all of this
diagnostics work but then some interface for an external daemon to do
this diagnostics is preferable to just wielding the big hammer and
killing all traffic.

In my experience if the device doesn't respond it usually just
disappeared from the network, if it is on the network and the task
abort or target reset do not return successfully it is either unlikely
that the host reset will help (the host is fine, device is gone) or
that the host reset is the only way since the host is dead but then a
simple check on all other luns would reveal that quite fast. In many
cases the host controller itself is mostly dead and the driver could
detect that on its own without waiting for the traffic to time out but
that's an issue for each driver to handle.

Baruch

Baruch
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html