Re: [PATCH] Check return value of fc_block_scsi_eh()

Mike Christie <michaelc@xxxxxxxxxxx> · Thu, 14 Oct 2010 14:02:47 -0500

On 10/14/2010 01:45 AM, Hannes Reinecke wrote:
Mike Christie wrote:
On 10/13/2010 06:42 AM, Hannes Reinecke wrote:

fc_block_scsi_eh() might return with status FAST_IO_FAIL
indicating I/O has been terminated due to fast_io_fail timeout.
In this case the rport is still blocked, so any error recovery
will be failing on this port.
Hence each driver needs to check if the return value from
fc_block_scsi_eh() is something other than SUCCESS, in which
case it should just return with that status.
And we need to update the error handler to deal with a
status of FAST_IO_FAIL, too.
And fc_block_scsi_eh() should return SUCCESS on success,
not 0. Otherwise the calling routine will become confused
when reusing that value.

Is this patch supposed to fix the problem I described or is there more
patches to follow or do you think I am being too paranoid? It seems the
patch alone is going to make the problem worse in the drivers you are
speeding up failure in. In the drivers now checking fc_block_scsi_eh and
returning when fast io fail is returned then the scsi eh
scsi_eh_flush_done_q's scsi_finish_command/scsi_queue_insert processing
is going to get done faster and more likely to conflict with the
termiante_rport_io callback processing, right?

No, this patch does _not_ fix the race you've described, it just
fixes the problem of fc_block_scsi_eh() returning FAST_IO_FAIL,
which screws the error handler in drivers not checking the return value.
So it's partially covered by your patchset.

However, in your patchset you're missing two issues:
- The current implementation of fc_block_scsi_eh() return '0' on
   success. This screws over drivers which re-uses this value;
   check eg. lpfc_abort_handler(). So we should be returning
   'SUCCESS' here.

Yeah, my patchset is actually backwards. I converted the other drivers 
to check for 0 (forgot to fix lpfc though). I think your patch patch 
where we return SUCCESS is better than what I was doing.

- scsi_error needs to be fixed up to handle FAST_IO_FAIL.
   Any of the functions called from scsi_abort_eh_cmnd() can return
   FAST_IO_FAIL, in which case escalating to the next function
   becomes pointless.

There is a problem with fnic (it is also a bug in my patches). If fnic's 
terminate_rport_io fails to abort the cmds, it is relying on the scsi eh 
callouts to clean things up. So if fc_scsi_block_eh returns non-success 
then we cannot return immediately from the callout. The drivers scsi eh 
callout may still need to clean up the command internally.

iSCSI/qla4xxx also needs to be changed to return SUCCESS instead of 
zero. Luckily there is only one driver using it so far.

I guess it would make sense to merge these two patchsets, either
having two patchsets (one for the FAST_IO_FAIL changes and one for
the rport race) or indeed merge them both into one.
I'm okay with either of it.

My patches for the race do not work though. When you replied to my 
thread I thought you mean you were going to fix that too :)

I have been thinking of how to fix the race. I will make my patches over 
yours since your return SUCCESS fix is better.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html