Re: [PATCH] scsi: Allow error handling timeout to be specified

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 5/13/2013 12:46 AM, Hannes Reinecke wrote:

> True. But and the end of the day, we _do_ want to recover the failed LUN.
> If we were to disable that faulty LUN and continue running with the others
> we won't have a chance of _ever_ recovering that one LUN.

	I don't buy this. Especially for FC devices, the vast majority of errors I see
are related to zoning, SFP and cabling problems. Once one of those happens you
tend to get a lot of shotgun debugging, which injects all kinds of
further errors.	None of these errors are fixed by the linux error recovery paths.

	That said, if the admin fixes something, for FC/SAS (and potentially others)
you _WILL_ get notification that the device is online again.


> SET when the link is down). So we basically _have_ to escalate it to the
> next level. Even though that will mean to stop I/O to other, hitherto
> unaffected instances.

	And a single failure, turns into performance bubbles and further errors on
other devices. Particularly if the functional devices are stateful, and the
error recovery mechanism isn't sufficiently intelligent about that state (see
tape drives). Think about what happens when a marginal SFP on a target causes
a device to repeatably drop off and reappear at some random point in the future.


	Anyway, It is possible to make a determination about the topology and make
decisions about the likely-hood of any given portion being at fault. For
example, if one lun on a target has failed and the remainder continue to work,
then its unlikely that if abort and lun reset fail that anything higher up in
the stack is going to succeed.

	I feel pretty strongly, at that point your better off providing good
diagnostics about the failure and expecting user interaction rather than
muddying the waters by causing other device interruptions. If the user tries
everything and determines that a HBA reset is the right choice, provide that
option, don't do it for them.

	If every device attached to the HBA fails then resetting the HBA is a valid
choice, not before. Same for I_T.



--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]
  Powered by Linux