On Wed, Jan 23, 2019 at 04:46:17PM -0800, Bart Van Assche wrote: > Several SCSI low-level drivers need to suspend .queuecommand() calls while > HBA or transport layer recovery happens. The iSCSI and SRP initiator drivers > use scsi_target_block() to block new .queuecommand() calls while recovery > happens. scsi_target_block() prevents that the block layer core triggers new > .queuecommand() calls but does not prevent that the SCSI error handler calls > .queuecommand(). SCSI LLD authors have the choice of either hoping that > .queuecommand() calls from the SCSI error handler won't happen while transport > layer recovery is in progress or to add code in the .queuecommand() function > that detects from which context that call comes and to delay such > .queuecommand() calls. In the SRP initiator driver that code looks as follows: > > const bool in_scsi_eh = !in_interrupt() && current == shost->ehandler; > > /* > * The SCSI EH thread is the only context from which srp_queuecommand() > * can get invoked for blocked devices (SDEV_BLOCK / > * SDEV_CREATED_BLOCK). Avoid racing with srp_reconnect_rport() by > * locking the rport mutex if invoked from inside the SCSI EH. > */ > if (in_scsi_eh) > mutex_lock(&rport->mutex); > > In my opinion the SCSI core should make it easy for LLD authors to prevent that > the error handler calls .queuecommand() while transport layer recovery is in > progress. So considerable time ago I posted several patches that modify the SCSI > error handler and that avoid that SCSI LLDs have to detect the context a > .queuecommand() call comes from. None of these patches were accepted and no > alternative approach was proposed. Hence the proposal to discuss this topic in > person during LSF/MM. > > See also "[PATCH 1/2] RDMA/srp: Avoid calling mutex_lock() from inside > scsi_queue_rq()" (https://www.spinics.net/lists/linux-rdma/msg73842.html). > Having SCSI EH run while transport recovery is running for the same context is a bit of a pain in general. I remember having seen situations like this with zFCP once or twice (~2 years ago). Especially when SCSI EH tries to unblock commands on the same context that is just going through transport recovery.. So e.g. EH wants to send a TUR (in EH) to a rport for which we just now do recovery for, then EH will fail, because we can't physically service that TUR right then, and EH will escalate, possibly with bad timing till it forces us through adapter recovery, which then faults all other rports as well. Having some more coordination here would be good. -- With Best Regards, Benjamin Block / Linux on IBM Z Kernel Development IBM Systems & Technology Group / IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Matthias Hartmann / Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294