Mike Christie wrote: > On 10/14/2010 01:45 AM, Hannes Reinecke wrote: >> Mike Christie wrote: >>> On 10/13/2010 06:42 AM, Hannes Reinecke wrote: >>>> >>>> fc_block_scsi_eh() might return with status FAST_IO_FAIL >>>> indicating I/O has been terminated due to fast_io_fail timeout. >>>> In this case the rport is still blocked, so any error recovery >>>> will be failing on this port. >>>> Hence each driver needs to check if the return value from >>>> fc_block_scsi_eh() is something other than SUCCESS, in which >>>> case it should just return with that status. >>>> And we need to update the error handler to deal with a >>>> status of FAST_IO_FAIL, too. >>>> And fc_block_scsi_eh() should return SUCCESS on success, >>>> not 0. Otherwise the calling routine will become confused >>>> when reusing that value. >>>> >>> >>> Is this patch supposed to fix the problem I described or is there more >>> patches to follow or do you think I am being too paranoid? It seems the >>> patch alone is going to make the problem worse in the drivers you are >>> speeding up failure in. In the drivers now checking fc_block_scsi_eh and >>> returning when fast io fail is returned then the scsi eh >>> scsi_eh_flush_done_q's scsi_finish_command/scsi_queue_insert processing >>> is going to get done faster and more likely to conflict with the >>> termiante_rport_io callback processing, right? >> >> No, this patch does _not_ fix the race you've described, it just >> fixes the problem of fc_block_scsi_eh() returning FAST_IO_FAIL, >> which screws the error handler in drivers not checking the return value. >> So it's partially covered by your patchset. >> >> However, in your patchset you're missing two issues: >> - The current implementation of fc_block_scsi_eh() return '0' on >> success. This screws over drivers which re-uses this value; >> check eg. lpfc_abort_handler(). So we should be returning >> 'SUCCESS' here. > > Yeah, my patchset is actually backwards. I converted the other drivers > to check for 0 (forgot to fix lpfc though). I think your patch patch > where we return SUCCESS is better than what I was doing. > > >> - scsi_error needs to be fixed up to handle FAST_IO_FAIL. >> Any of the functions called from scsi_abort_eh_cmnd() can return >> FAST_IO_FAIL, in which case escalating to the next function >> becomes pointless. > > There is a problem with fnic (it is also a bug in my patches). If fnic's > terminate_rport_io fails to abort the cmds, it is relying on the scsi eh > callouts to clean things up. So if fc_scsi_block_eh returns non-success > then we cannot return immediately from the callout. The drivers scsi eh > callout may still need to clean up the command internally. > Maybe it's better to change fnic to modify the FAST_IO_FAIL return value from fc_block_scsi_eh() and return FAILED here. > iSCSI/qla4xxx also needs to be changed to return SUCCESS instead of > zero. Luckily there is only one driver using it so far. > Yes, that would be preferable. >> >> I guess it would make sense to merge these two patchsets, either >> having two patchsets (one for the FAST_IO_FAIL changes and one for >> the rport race) or indeed merge them both into one. >> I'm okay with either of it. >> > > My patches for the race do not work though. When you replied to my > thread I thought you mean you were going to fix that too :) > > I have been thinking of how to fix the race. I will make my patches over > yours since your return SUCCESS fix is better. Ok, I'll be sending an updated patch. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@xxxxxxx +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html