Re: fc_remote_port_delete and returning SCSI commands from LLD

Mike Christie <michaelc@xxxxxxxxxxx> · Tue, 27 Oct 2009 16:53:50 -0500

Christof Schmitt wrote:
On Wed, Oct 21, 2009 at 01:11:15PM -0500, Mike Christie wrote:
Christof Schmitt wrote:
If the remote_port status is not BLOCKED, this will trigger the SCSI
midlayer error handling which cannot do much during the interruption
to the hardware and will mark the SCSI devices 'offline'. In order to
prevent this, the rule would be: First call fc_remote_port_delete to
set the remote port (or in the case of an HBA interruption all remote
ports) to BLOCKED, and only after this step call scsi_done to pass the
SCSI commands back to the upper layers.

One other note when doing this.

For problems where you are deleting the rport, it is best to use  
something like DID_TRANSPORT_DISRUPTED to fail the cmd if you are  
failing it right away.

"something like DID_TRANSPORT_DISRUPTED" would be any error code that
goes through "maybe_retry" in scsi_decide_disposition? I guess moving
to DID_TRANSPORT_DISRUPTED is nice for consistency, but DID_ERROR
triggers the same code paths as far as i can see.

It could be a little different. See scsi_noretry_cmd. If you used 
DID_ERROR and something set the driver failfast bit then it would be 
fast failed.

If drivers block the rport, then fail commands  
immediately with DID_TRANSPORT_DISRUPTED, then they will not actually be  
failed to the block/mpath layer until the fast io fail timeout has  
fired. This will prevent very short problems from firing the mutlipath  
path offlining code.

Just to get the complete picture: Blocking the rport and then
returning DID_TRANSPORT_DISRUPTED will retry the command to the LLD
which then first calls fc_remote_port_chkready.
fc_remote_port_chkready will then keep the command between LLD and
SCSI midlayer until the rport state changes or the fast_fail fires.
Is this the complete picture or did i miss something?

I think that is it.

If your driver deletes the rport and does not fail the cmd immediately  
so it can recover within the command or some other reason like the fw  
just works that way, then when the fast io fail timer fires and the  
terminate_rport_io callback is run you could actually use any error code  
since at this time when a IO is sent to the queuecommand the driver will  
call fc_remote_port_chkready and IO will be failed immediately with  
DID_TRANSPORT_FAILFAST).

And the rport state is still BLOCKED, so at this point commands failed
in the upper layers with blk_abort_request will not end up in the SCSI
error recovery which cannot do much...

Thanks for the help, i am starting to get the complete picture...

Christof

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html