Re: Debugging scsi abort handling ?

Hannes Reinecke <hare@xxxxxxx> · Fri, 29 Aug 2014 12:30:26 +0200

On 08/29/2014 12:14 PM, Finn Thain wrote:

On Fri, 29 Aug 2014, Hannes Reinecke wrote:

On 08/29/2014 06:39 AM, Finn Thain wrote:

On Thu, 28 Aug 2014, Hannes Reinecke wrote:

What might happen, though, that the command is already dead and gone
by the time you're calling ->scsi_done() (if you call it after
eh_abort). So there might not _be_ a command upon which you can call
->scsi_done() to start with.

Hence any LLDD need to clear up any internal references after a call
to eh_XXX to ensure it doesn't call ->scsi_done() an in invalid
command.

So even if the LLDD returns 'FAILED' upon a call to eh_XXX it
_still_ needs to clear up the internal reference.

This is a question that has been bothering me too. If the host's
eh_abort_cmd() method returns FAILED, it seems the mid-layer is liable
to re-issue the same command to the LLD (?)

No.
FAILED for any eh_abort_cmd() means that the TMF hasn't been sent.

Makes sense, though it appears to contradict this advice about returning
SUCCESS in some situations:
http://marc.info/?l=linux-scsi&m=140923498632496&w=2

Well, if the LLDD detects an invalid command (ie if it cannot find 
any internal command matching the midlayer command) that's an 
automatic success, obviously.

So we should rephrase things to:

- The eh_XXX callback shall return 'SUCCESS' if the respective
  TMF (or equvalent) could be initiated or if the matching command
  reference has already been completed by the LLDD. Otherwise
  the eh_XXX callback shall return 'FAILED'.

The command will only ever be re-issued once EH completes.

...

But indeed, 'FAILED' is not very meaningful here, leaving the midlayer
with no information about what happened to the command.

Personally I would like to enforce this meaning on the eh_XXX callbacks:
- upon each eh_XXX callback the LLDD clears any internal references
   to the command / command scope (ie eh_abort_cmd clears the
   references to the command, eh_lun_reset clears all internal
   references to commands to this ITL nexus etc.)
   This happens irrespective of the return code.
- The eh_XXX callback shall return 'FAILED' if the respective
   TMF (or equivalent) could not be initiated.
- The eh_XXX callback shall return 'SUCCESS' if the respective
   TMF (or equvalent) could be initiated.
- After each eh_XXX callback control for this command / command
   scope is transferred back to the midlayer; the LLDD shall not
   assume the associated command structures to remain valid after
   that point.

Perhaps that last constraint should be relaxed to "After the final EH
callback (whether implemented or unimplemented by the host), command /
command scope is transferred back to the midlayer..."

No, that's wrong.

By the time any eh_XXX callbacks are triggered control _is_ already 
back at the midlayer. IE the command timeout triggered and the block 
layer already set the REQ_ATOM_COMPLETED flag, short-circuiting any 
attempts to call ->scsi_done().
So with the callbacks the midlayer actually informs the LLDD about a 
certain fact; there is nothing the LLDD can do to change ownership 
at that point.

(Correction: During the call of any eh_XXX callbacks control _is_ 
back at the LLDD, otherwise the callbacks would be pointless. It's
just that the LLDD shouldn't assume the command is valid _after_
any of the eh_XXX callbacks has terminated.)

A more severe TMF is probably mandatory (e.g. bus reset) but if the driver
author later added a milder one (e.g. bus device reset), your rule would
mean that the existing handler would then operate under new constraints,
which might cause surprises.

Well, _if_ we were to adopt this rule we obviously have to audit
existing LLDDs if the rule is followed, and tweak them if not.

Cheers,

Hannes
--
Dr. Hannes Reinecke		      zSeries & Storage
hare@xxxxxxx			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html