Re: [PATCH 2/4] scsi: add transport host byte errors

James Smart <James.Smart@xxxxxxxxxx> · Thu, 15 Mar 2007 09:20:43 -0400

michaelc@xxxxxxxxxxx wrote:
fast_io_fail_tmo
iscsi: session recovery_tmo
fc: rport fast_io_fail_tmo

The difference is that when the timer fires, for iscsi we unblock the
queue and fail commands in the blocked queue. FC just fails IO running
in the driver/fw/hw. The IO in the blocked queue sits there until dev_loss_tmo.

True - FC contacted the LLDD to terminate i/o, who has no notion of any io
that has yet to be sent to it via queuecommand(). Blocked i/o sits until
dev_loss_tmo, as that is when the sdev gets torn down. Perhaps a block layer call
should be created, that the FC transport can call, to terminate the blocked
queue.  Thoughts ?

dev_loss_tmo
iscsi: none yet (we are working on it :))
fc: dev_loss_tmo

Currently, if there is a transport problem the iscsi drivers will return
outstanding commands (commands being exeucted by the driver/fw/hw) with
DID_BUS_BUSY and block the session so no new commands can be queued.
Commands that are caught between the failure handling and blocking are
failed with DID_IMM_RETRY or one of the scsi ml queuecommand return values.
When the recovery_timeout fires, the iscsi drivers then fail IO with
DID_NO_CONNECT.

For fcp, some drivers will fail some outstanding IO (disk but possibly not
tape) with DID_BUS_BUSY or some other value that causes a retry and hits
the scsi_error.c failfast check, block the rport, and commands caught in the
race are failed with DID_IMM_RETRY. Other drivers, will hold onto all IO
and wait for the terminate_rport_io or dev_loss_tmo_callbk to be called.
In this case lpfc, could return the IO with DID_ERROR.

Note: Variability in behavior has to be allowed as both implementations are
within FC specification. Also, the "everything killed" scenario is a valid
worst case behavior that can always occur. The "it's not killed immediately"
scenario is an optimization towards best-case behavior (with better FC-MI-2
compliance).

Lpfc returns DID_ERROR as the io requests had been queued to the adapter,
may have gone out on the wire, and may have changed media. They were terminated
early based on the respective timeout. Thus, a BUSY status, which implies
no media change, is deemed inappropriate.  Based on the conversation, you
are implying that the layer above, which asked for the fastfail may want to
distinguish between an io terminated due to the fastfail timeout vs an io
that failed due to a real error. Easy enough to do - we just need a new
return status.  And, I see, that's what the patch below does.

So far, so good....

-- james s

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html