On Mon, May 13, 2013 at 6:58 PM, Jeremy Linton <jlinton@xxxxxxxxxxxxx> wrote: > On 5/13/2013 10:03 AM, Hannes Reinecke wrote: >> The other LUNs haven't reported an error. But how do you know whether they >> are still okay? The other LUNs might simply be idle, and no commands have >> been send to them. > > Well, how about generating std inquiry against them if they are idle and the > given HBA has a device in error state? Then you can make a rough approximation > of what has failed, and escalate the error handling if all the devices at a > particular level have failed. > > The midlayer may not even need to send the inquiries. If the individual > device drivers (sd/st/etc) are responsible for monitoring and error recovery > then they can be tasked with determining device availability as well. I think > this solves other problems too. For example, the use of TUR in the midlayer, > is a problem because it doesn't have enough knowledge about the possible check > conditions being returned to act on them appropriately. Such an approach is preferable IMO than the big hammer, especially if we are talking about a likely condition of using multipath and having other links over the same host that do have traffic flowing through them. If there is traffic already on the same host there is no reason to do a host reset, if there is no traffic and there are no other luns, go for the big gun it will not matter to anything else, if there are other inactive luns some mechanism to trigger some basic traffic (inquiry/tur) on them is much preferable to just a plain big hammer application. It might be that the kernel is not the right place for all of this diagnostics work but then some interface for an external daemon to do this diagnostics is preferable to just wielding the big hammer and killing all traffic. In my experience if the device doesn't respond it usually just disappeared from the network, if it is on the network and the task abort or target reset do not return successfully it is either unlikely that the host reset will help (the host is fine, device is gone) or that the host reset is the only way since the host is dead but then a simple check on all other luns would reveal that quite fast. In many cases the host controller itself is mostly dead and the driver could detect that on its own without waiting for the traffic to time out but that's an issue for each driver to handle. Baruch Baruch -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html