On 9/17/20 11:00 PM, Ewan D. Milne wrote: > On Tue, 2020-09-15 at 16:28 +0530, Jitendra Khasdev wrote: >> This is patch to fix the race occurs between bus detach and alua_rtpg. >> >> It fluses the all pending workqueue in bus detach handler, so it can avoid >> race between alua_bus_detach and alua_rtpg. >> >> Here is call trace where race got detected. >> >> multipathd call stack: >> [exception RIP: native_queued_spin_lock_slowpath+100] >> --- <NMI exception stack> --- >> native_queued_spin_lock_slowpath at ffffffff89307f54 >> queued_spin_lock_slowpath at ffffffff89307c18 >> _raw_spin_lock_irq at ffffffff89bd797b >> alua_bus_detach at ffffffff8984dcc8 >> scsi_dh_release_device at ffffffff8984b6f2 >> scsi_device_dev_release_usercontext at ffffffff89846edf >> execute_in_process_context at ffffffff892c3e60 >> scsi_device_dev_release at ffffffff8984637c >> device_release at ffffffff89800fbc >> kobject_cleanup at ffffffff89bb1196 >> kobject_put at ffffffff89bb12ea >> put_device at ffffffff89801283 >> scsi_device_put at ffffffff89838d5b >> scsi_disk_put at ffffffffc051f650 [sd_mod] >> sd_release at ffffffffc051f8a2 [sd_mod] >> __blkdev_put at ffffffff8952c79e >> blkdev_put at ffffffff8952c80c >> blkdev_close at ffffffff8952c8b5 >> __fput at ffffffff894e55e6 >> ____fput at ffffffff894e57ee >> task_work_run at ffffffff892c94dc >> exit_to_usermode_loop at ffffffff89204b12 >> do_syscall_64 at ffffffff892044da >> entry_SYSCALL_64_after_hwframe at ffffffff89c001b8 >> >> kworker: >> [exception RIP: alua_rtpg+2003] >> account_entity_dequeue at ffffffff892e42c1 >> alua_rtpg_work at ffffffff8984f097 >> process_one_work at ffffffff892c4c29 >> worker_thread at ffffffff892c5a4f >> kthread at ffffffff892cb135 >> ret_from_fork at ffffffff89c00354 >> >> Signed-off-by: Jitendra Khasdev <jitendra.khasdev@xxxxxxxxxx> >> --- >> drivers/scsi/device_handler/scsi_dh_alua.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c >> index f32da0c..024a752 100644 >> --- a/drivers/scsi/device_handler/scsi_dh_alua.c >> +++ b/drivers/scsi/device_handler/scsi_dh_alua.c >> @@ -1144,6 +1144,9 @@ static void alua_bus_detach(struct scsi_device *sdev) >> struct alua_dh_data *h = sdev->handler_data; >> struct alua_port_group *pg; >> >> + sdev_printk(KERN_INFO, sdev, "%s: flushing workqueues\n", ALUA_DH_NAME); >> + flush_workqueue(kaluad_wq); >> + >> spin_lock(&h->pg_lock); >> pg = rcu_dereference_protected(h->pg, lockdep_is_held(&h->pg_lock)); >> rcu_assign_pointer(h->pg, NULL); > > I'm not sure this is the best solution. The current code > references h->sdev when the dh_list is traversed. So it needs > to remain valid. Fixing it by flushing the workqueue to avoid > the list traversal code running leaves open the possibility that > future code alterations may expose this problem again. > > -Ewan > > I see your point, but as we are in detach handler and this code path only execute when device is being detached. So, before detaching, flush work-queue will take care of any current code references h->sdev where dh_list is being traversed. IMO, I do not think it would create any problem for future code alterations. Or may be I am missing something over here, what could be possible scenario for that? --- Jitendra