On 7/12/22 5:19 PM, Bart Van Assche wrote: > From: Ming Lei <ming.lei@xxxxxxxxxx> > > Fix the race conditions between SCSI LLD kernel module unloading and SCSI > device and target removal by making sure that SCSI hosts are destroyed after > all associated target and device objects have been freed. > > Cc: Christoph Hellwig <hch@xxxxxx> > Cc: Ming Lei <ming.lei@xxxxxxxxxx> > Cc: Mike Christie <michael.christie@xxxxxxxxxx> > Cc: Hannes Reinecke <hare@xxxxxxx> > Cc: John Garry <john.garry@xxxxxxxxxx> > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > Signed-off-by: Bart Van Assche <bvanassche@xxxxxxx> > [ bvanassche: Reworked Ming's patch and split it ] > --- > drivers/scsi/hosts.c | 8 ++++++++ > drivers/scsi/scsi_scan.c | 7 +++++++ > include/scsi/scsi_host.h | 3 +++ > 3 files changed, 18 insertions(+) > > diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c > index ef6c0e37acce..8fa98c8d0ee0 100644 > --- a/drivers/scsi/hosts.c > +++ b/drivers/scsi/hosts.c > @@ -190,6 +190,13 @@ void scsi_remove_host(struct Scsi_Host *shost) > transport_unregister_device(&shost->shost_gendev); > device_unregister(&shost->shost_dev); > device_del(&shost->shost_gendev); > + > + /* > + * After scsi_remove_host() has returned the scsi LLD module can be > + * unloaded and/or the host resources can be released. Hence wait until > + * the dependent SCSI targets and devices are gone before returning. > + */ > + wait_event(shost->targets_wq, atomic_read(&shost->target_count) == 0); > } If we only wait here we can still hit the race I described right? Is the issue where we might be misunderstanding each other that the target removal is slightly different from the host removal? For host removal we call scsi_forget_host with the scan_mutex already held. So when scsi_forget_host loops over the devices we know that there is no thread doing: sdev_store_delete -> scsi_remove_device -> __scsi_remove_device -> blk_cleanup_queue Since the sdev_store_delete call to scsi_remove_device call also grabs the scan_mutex, we can't call scsi_forget_host until sdev_store_delete -> scsi_remove_device has returned. For target removal,__scsi_remove_target doesn't take the scan_mutex when checking the device state. So, we have this race: 1. syfs deletion runs sdev_store_delete -> scsi_remove_device and that takes the scan_mutex. It then sets the state to SDEV_DEL. 2. fc/iscsi thread does __scsi_remove_target and it sees the device is in the SDEV_DEL state. It skips the device and then we return from __scsi_remove_target without having waited on that device's removal like is done in other cases. If the only issue we are concerned with is blk_cleanup_queue completing when we remove the host or target, then for the target race above I think we can just use the scan_mutex in __scsi_remove_target (that function then would use __scsi_remove_device). If the issue is that there would be other threads holding a ref to the scsi_device and they can call into the driver. and we want to make sure those refs are dropped when scsi_remove_target returns then we need to do what I described in the other thread.