Re: Error handling on FC devices

Hannes Reinecke <hare@xxxxxxx> · Mon, 03 Dec 2012 08:15:10 +0100

On 11/30/2012 05:54 PM, Mike Christie wrote:
On 11/30/2012 05:44 AM, Hannes Reinecke wrote:
On 11/29/2012 05:02 PM, James Smart wrote:
Always possible - but....   Our f/w works at the FCP level and
below, which means it doesn't know/do SCSI commands - e.g what the
cdb within the FCP CMD frame is; know anything about SCSI device
classes and state; etc. And it shouldn't be required to do so.
Anytime this has been there in the past, it's been problematic.

if we want to do this - we should add it to the midlayer/transport.

D'accord. Transport layer looks like a good fit.

What we should be doing is hooking up 'bus_reset' to be equivalent to
REMOVE I_T NEXUS (SAS is already doing this).

Do you mean the scsi eh bus reset callout and if so does that work on
multiple targets but REMOVE I_T NEXUS only will operate on one at a
time? I think it would be cleaner to add a new callout that works like
the target reset one where the scsi-ml loops over the targets for the
drivers.

Well, looking at QLogic and Emulex both emulate a bus reset with a 
loop over each target and invoke a target reset there.
I somewhat fail to see the rationale behind it, other than emulating 
the bus reset behaviour on SPI.
Given that the original target reset already failed (otherwise we 
wouldn't be doing a bus reset), I doubt a _second_ target reset
will lead to a different result.

So invoking REMOVE I_T NEXUS here can only improve matters :-)

I'm all for renaming bus_reset, though :-)

In our case a REMOVE I_T NEXUS would be roughly equivalent to
scsi_remote_port_delete(); only we should be starting aborting
outstanding I/O directly and not waiting for fast_fail_tmo
to kick in.

To abort IO, will you be calling the drivers terminate_rport_io or
dev_loss_tmo_callbk? If so I just wanted to warn you that I noticed that
some drivers will only initiate the aborting/cleanup of IO in there. So
if you call those callouts and expect that when finished scsi-ml can
free the scsi command and pass the request back up, I think we could hit
some races with memory issues.

Yeah, I know.
What I had in mind was to invoke terminate_rport_io() and then wait 
for a certain time until either all outstanding commands have been
processes (ie starget->busy drops to zero) or the port state changed.
I'm not quite sure as for how long I should be waiting, but 
dev_loss_tmo will be a good upper limit here.

As said, I'll be posting a patch.

Cheers,

Hannes
--
Dr. Hannes Reinecke		      zSeries & Storage
hare@xxxxxxx			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html