Re: SCSI error handling -- one error blocks the whole SCSI host

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Sun, 26 May 2013 15:44:02 -0700

On Thu, 2013-05-23 at 11:14 -0700, Roland Dreier wrote:
> At LSF this year, we had a discussion about error handling and in
> particular the problem that SCSI midlayer error handling waits for the
> entire SCSI host (HBA) to quiesce before it starts to abort commands
> etc.
> 
> James made the suggestion that FC should handle things the way SAS
> does, because SAS has a strategy handler that does things the right
> way.  However, now that I finally sit down and look at the code, I
> don't see how this is the case.  It seems inherent in the way that
> scsi_eh_scmd_add() and the thread in scsi_error_handler() work (in
> particular the strategy handler can't even be called until host_failed
> == host_busy; we don't bump host_failed without SHOST_RECOVERY set,
> which stops queueing commands to any devices attached to the whole
> HBA).
> 
> James, am I understanding your suggestion properly?  If so can you
> explain what you meant about the libsas code -- I see that it has its
> own strategy handler but as I said before we've already stopped every
> device attached to the HBA before we ever get there.

It is, but I checked: Apparently it's not implemented in the sas
transport class.  The original discussion when libsas was constructed,
as I remember it, was about using the scsi timeout handler to implement
a running abort.  The idea is fairly simple: you use the first fire of
eh_timed_out to trigger the abort (or LUN reset) while simultaneously
returning BLK_EH_RESET_TIMER.  If the timer fires again and the abort
hasn't returned, you escalate, otherwise you resend the command when the
abort returns.  This allows you to handle single command failures (up to
LUN reset) without stopping the host.  Obviously, if you have to
escalate to device reset, then you need to start the eh thread.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html