Question about scsi error handling and LLD

scameron@xxxxxxxxxxxxxxxxxx · Tue, 19 Aug 2014 10:30:51 -0500

Is it permitted for a LLD to call scsi_done() for a command which
has been or is attempting to be aborted via the LLD abort error handler
function?

I am looking at the scsi mid layer code, and I'm a bit confused.

There seem to be a few ways the LLD abort handler function can
get called.

One path:

scsi_times_out() calls scsi_abort_command() which queues
cmd->abort_work in 10ms.  This will cause scmd_eh_abort_handler()
to be called, which can call scsi_try_to_abort_cmd() which
calls the LLD abort error handler function.  If the LLD abort
error handler function returns SUCCESS, then scmd_eh_abort_handler()
may requeue the command via scsi_queue_insert() re-using the same
scsi_cmnd pointer.  Or, it may call scsi_finish_cmnd() on that
scsi_cmnd.

Both of those actions, requeueing the same scsi command, or calling
scsi_finish_cmnd(), seem incompatible with the LLD also having
called scsi_done() on the command.

I think, scsi_done eventually frees the scsi_cmnd...

scsi_done() -> blk_complete_request -> __blk_complete_request ->
	-> raise_softirq_irqoff	-> ??? -> scsi_softirq_done() ->
	-> scsi_finish_cmnd() -> scsi_io_completion -> scsi_end_request()
	-> scsi_mq_uninit_cmd() ... etc.

It seems like either calling scsi_finish_cmnd() too many times (once via
LLD calling scsi_done, and once via scmd_eh_abort_handler calling
scsi_finish_cmnd()), or calling scsi_done() via LLD concurrently with
re-using that same scsi command (via scmd_eh_abort_handler calling
scsi_queue_insert()) would both be bad things.  Correct?

So, without completely understanding it, it seems like if the LLD
abort handler is going to return SUCCESS, it must be sure that it
hasn't already called scsi_done() on the command.  Or to put it
another way, the LLD must make sure there is not a race between
the abort handler and the interrupt handler, and the instant
the LLD abort handler is called for a command, calling scsi_done()
for that command is no longer an option for the LLD.  But, I do
not see how to synchronize that.

I hope I'm missing something here, because I am having a hard time
seeing how to eliminate the race between the LLD abort handler and
command completion in the LLD, esp. since now (due to my hardware's
specific weirdness) I may in the interrupt handler requeue the command
internally in a workqueue within the LLD for submission to HW a 2nd
time before completion back to the mid layer under some circumstances
(e.g. fast path failed, try slower but more robust RAID stack path.)

But, then there is this comment attached to scsi_times_out() which
makes me feel better:

 * Notes:
 *     We do not need to lock this.  There is the potential for a race
 *     only in that the normal completion handling might run, but if the
 *     normal completion function determines that the timer has already
 *     fired, then it mustn't do anything.
 */

So, perhaps that is the answer I'm looking for.  If the comment is
correct and my understanding of it is correct, then the LLD *is* free
to call scsi_done() even though it races with the LLD abort handler?

I'm not finding the code that implements what that comment claims though.
Is it in the block layer code?

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html