Will this patch be integrated into scsi-misc or other tree at some point? Thanks, Mike James Smart wrote: > In a prior posting to linux-scsi on the fc transport and workq > deadlocks, we noted a second error that did not have a patch: > http://marc.theaimsgroup.com/?l=linux-scsi&m=114467847711383&w=2 > - There's a deadlock where scsi_remove_target() has to sit behind > scsi_scan_target() due to contention over the scan_lock(). > > Subsequently we posted a request for comments about the deadlock: > http://marc.theaimsgroup.com/?l=linux-scsi&m=114469358829500&w=2 > > This posting resolves the second error. Here's what we now understand, > and are implementing: > > If the lldd deletes the rport while a scan is active, the sdev's queue > is blocked which stops the issuing of commands associated with the scan. > At this point, the scan stalls, and does so with the shost->scan_mutex held. > If, at this point, if any scan or delete request is made on the host, it > will stall waiting for the scan_mutex. > > For the FC transport, we queue all delete work to a single workq. > So, things worked fine when competing with the scan, as long as the > target blocking the scan was the same target at the top of our delete > workq, as the delete workq routine always unblocked just prior to > requesting the delete. Unfortunately, if the top of our delete workq > was for a different target, we deadlock. Additionally, if the target > blocking scan returned, we were unblocking it in the scan workq routine, > which really won't execute until the existing stalled scan workq > completes (e.g. we're re-scheduling it while it is in the midst of its > execution). > > This patch moves the unblock out of the workq routines and moves it to > the context that is scheduling the work. This ensures that at some point, > we will unblock the target that is blocking scan. Please note, however, > that the deadlock condition may still occur while it waits for the > transport to timeout an unblock on a target. Worst case, this is bounded > by the transport dev_loss_tmo (default: 30 seconds). > > > Finally, Michael Reed deserves the credit for the bulk of this patch, > analysis, and it's testing. Thank you for your help. > > -- james s > > Note: this patch supercedes a previous preliminary post by Michael Reed > http://marc.theaimsgroup.com/?l=linux-scsi&m=114675967824701&w=2 > > Note: The request for comments statements about the gross-ness of the > scan_mutex still stand. > > > > Signed-off-by: Michael Reed <mdr@xxxxxxx> > Signed-off-by: James Smart <james.smart@xxxxxxxxxx> > > > diff -upNr a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c > --- a/drivers/scsi/scsi_transport_fc.c 2006-05-11 10:43:31.000000000 -0400 > +++ b/drivers/scsi/scsi_transport_fc.c 2006-05-11 12:04:36.000000000 -0400 > @@ -1284,7 +1284,9 @@ EXPORT_SYMBOL(fc_release_transport); > * @work: Work to queue for execution. > * > * Return value: > - * 0 on success / != 0 for error > + * 1 - work queued for execution > + * 0 - work is already queued > + * -EINVAL - work queue doesn't exist > **/ > static int > fc_queue_work(struct Scsi_Host *shost, struct work_struct *work) > @@ -1434,8 +1436,6 @@ fc_starget_delete(void *data) > struct Scsi_Host *shost = rport_to_shost(rport); > unsigned long flags; > > - scsi_target_unblock(&rport->dev); > - > spin_lock_irqsave(shost->host_lock, flags); > if (rport->flags & FC_RPORT_DEVLOSS_PENDING) { > spin_unlock_irqrestore(shost->host_lock, flags); > @@ -1707,6 +1707,8 @@ fc_remote_port_add(struct Scsi_Host *sho > > spin_unlock_irqrestore(shost->host_lock, flags); > > + scsi_target_unblock(&rport->dev); > + > return rport; > } > } > @@ -1762,9 +1764,10 @@ fc_remote_port_add(struct Scsi_Host *sho > /* initiate a scan of the target */ > rport->flags |= FC_RPORT_SCAN_PENDING; > scsi_queue_work(shost, &rport->scan_work); > - } > - > - spin_unlock_irqrestore(shost->host_lock, flags); > + spin_unlock_irqrestore(shost->host_lock, flags); > + scsi_target_unblock(&rport->dev); > + } else > + spin_unlock_irqrestore(shost->host_lock, flags); > > return rport; > } > @@ -1938,6 +1941,7 @@ fc_remote_port_rolechg(struct fc_rport > rport->flags |= FC_RPORT_SCAN_PENDING; > scsi_queue_work(shost, &rport->scan_work); > spin_unlock_irqrestore(shost->host_lock, flags); > + scsi_target_unblock(&rport->dev); > } > } > EXPORT_SYMBOL(fc_remote_port_rolechg); > @@ -1970,8 +1974,9 @@ fc_timeout_deleted_rport(void *data) > dev_printk(KERN_ERR, &rport->dev, > "blocked FC remote port time out: no longer" > " a FCP target, removing starget\n"); > - fc_queue_work(shost, &rport->stgt_delete_work); > spin_unlock_irqrestore(shost->host_lock, flags); > + scsi_target_unblock(&rport->dev); > + fc_queue_work(shost, &rport->stgt_delete_work); > return; > } > > @@ -2035,17 +2040,15 @@ fc_timeout_deleted_rport(void *data) > * went away and didn't come back - we'll remove > * all attached scsi devices. > */ > - fc_queue_work(shost, &rport->stgt_delete_work); > - > spin_unlock_irqrestore(shost->host_lock, flags); > + > + scsi_target_unblock(&rport->dev); > + fc_queue_work(shost, &rport->stgt_delete_work); > } > > /** > * fc_scsi_scan_rport - called to perform a scsi scan on a remote port. > * > - * Will unblock the target (in case it went away and has now come back), > - * then invoke a scan. > - * > * @data: remote port to be scanned. > **/ > static void > @@ -2057,7 +2060,6 @@ fc_scsi_scan_rport(void *data) > > if ((rport->port_state == FC_PORTSTATE_ONLINE) && > (rport->roles & FC_RPORT_ROLE_FCP_TARGET)) { > - scsi_target_unblock(&rport->dev); > scsi_scan_target(&rport->dev, rport->channel, > rport->scsi_target_id, SCAN_WILD_CARD, 1); > } > > > - > : send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > - : send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html