On 9/23/20 1:47 PM, Hannes Reinecke wrote: > On 9/18/20 5:49 AM, jitendra.khasdev@xxxxxxxxxx wrote: >> >> >> On 9/17/20 11:00 PM, Ewan D. Milne wrote: >>> On Tue, 2020-09-15 at 16:28 +0530, Jitendra Khasdev wrote: >>>> This is patch to fix the race occurs between bus detach and alua_rtpg. >>>> >>>> It fluses the all pending workqueue in bus detach handler, so it can avoid >>>> race between alua_bus_detach and alua_rtpg. >>>> >>>> Here is call trace where race got detected. >>>> >>>> multipathd call stack: >>>> [exception RIP: native_queued_spin_lock_slowpath+100] >>>> --- <NMI exception stack> --- >>>> native_queued_spin_lock_slowpath at ffffffff89307f54 >>>> queued_spin_lock_slowpath at ffffffff89307c18 >>>> _raw_spin_lock_irq at ffffffff89bd797b >>>> alua_bus_detach at ffffffff8984dcc8 >>>> scsi_dh_release_device at ffffffff8984b6f2 >>>> scsi_device_dev_release_usercontext at ffffffff89846edf >>>> execute_in_process_context at ffffffff892c3e60 >>>> scsi_device_dev_release at ffffffff8984637c >>>> device_release at ffffffff89800fbc >>>> kobject_cleanup at ffffffff89bb1196 >>>> kobject_put at ffffffff89bb12ea >>>> put_device at ffffffff89801283 >>>> scsi_device_put at ffffffff89838d5b >>>> scsi_disk_put at ffffffffc051f650 [sd_mod] >>>> sd_release at ffffffffc051f8a2 [sd_mod] >>>> __blkdev_put at ffffffff8952c79e >>>> blkdev_put at ffffffff8952c80c >>>> blkdev_close at ffffffff8952c8b5 >>>> __fput at ffffffff894e55e6 >>>> ____fput at ffffffff894e57ee >>>> task_work_run at ffffffff892c94dc >>>> exit_to_usermode_loop at ffffffff89204b12 >>>> do_syscall_64 at ffffffff892044da >>>> entry_SYSCALL_64_after_hwframe at ffffffff89c001b8 >>>> >>>> kworker: >>>> [exception RIP: alua_rtpg+2003] >>>> account_entity_dequeue at ffffffff892e42c1 >>>> alua_rtpg_work at ffffffff8984f097 >>>> process_one_work at ffffffff892c4c29 >>>> worker_thread at ffffffff892c5a4f >>>> kthread at ffffffff892cb135 >>>> ret_from_fork at ffffffff89c00354 >>>> >>>> Signed-off-by: Jitendra Khasdev <jitendra.khasdev@xxxxxxxxxx> >>>> --- >>>> drivers/scsi/device_handler/scsi_dh_alua.c | 3 +++ >>>> 1 file changed, 3 insertions(+) >>>> >>>> diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c >>>> index f32da0c..024a752 100644 >>>> --- a/drivers/scsi/device_handler/scsi_dh_alua.c >>>> +++ b/drivers/scsi/device_handler/scsi_dh_alua.c >>>> @@ -1144,6 +1144,9 @@ static void alua_bus_detach(struct scsi_device *sdev) >>>> struct alua_dh_data *h = sdev->handler_data; >>>> struct alua_port_group *pg; >>>> + sdev_printk(KERN_INFO, sdev, "%s: flushing workqueues\n", ALUA_DH_NAME); >>>> + flush_workqueue(kaluad_wq); >>>> + >>>> spin_lock(&h->pg_lock); >>>> pg = rcu_dereference_protected(h->pg, lockdep_is_held(&h->pg_lock)); >>>> rcu_assign_pointer(h->pg, NULL); >>> >>> I'm not sure this is the best solution. The current code >>> references h->sdev when the dh_list is traversed. So it needs >>> to remain valid. Fixing it by flushing the workqueue to avoid >>> the list traversal code running leaves open the possibility that >>> future code alterations may expose this problem again. >>> >>> -Ewan >>> >>> >> >> I see your point, but as we are in detach handler and this code path >> only execute when device is being detached. So, before detaching, flush >> work-queue will take care of any current code references h->sdev where >> dh_list is being traversed. >> > Flushing the workqueue is a bit of an overkill, seeing that we know exactly which workqueue element we're waiting for. > >> IMO, I do not think it would create any problem for future code >> alterations. Or may be I am missing something over here, what could >> be possible scenario for that? >> > Problem is more that I'd like to understand where exactly the race condition is. Can you figure out which spinlock is triggering in your stack trace? > > Cheers, > > Hannes Hannes, Race is between "alua_bus_detach" and "alua_rtpg_work". Whenever we perform fail-over or turn off the switch, the path goes down, which eventually triggers blkdev_put -> .. -> scsi_device_dev_release -> .. -> alua_bus_detach meanwhile another thread of alua_rtpg_work also running in parallel. Both threads are using sdev. In alua_bus_detach, we are setting null to sdev. From above call trace (multipathd) we can see alua_bus_deatch ran first and set sdev to null. It keeps its execution continue and it does not have any problem. 1138 /* 1139 * alua_bus_detach - Detach device handler 1140 * @sdev: device to be detached from 1141 */ 1142 static void alua_bus_detach(struct scsi_device *sdev) 1143 { 1144 struct alua_dh_data *h = sdev->handler_data; 1145 struct alua_port_group *pg; 1146 1147 spin_lock(&h->pg_lock); 1148 pg = rcu_dereference_protected(h->pg, lockdep_is_held(&h->pg_lock)); 1149 rcu_assign_pointer(h->pg, NULL); *1150* h->sdev = NULL; << Looks detach handler won the race and set sdev to null 1151 spin_unlock(&h->pg_lock); 1152 if (pg) { 1153 spin_lock_irq(&pg->lock); <<< from the call trace we can see that we just acquired the lock and got NMI exception because we encountered a BUG_ON from different thread. 1154 list_del_rcu(&h->node); Meanwhile alua_rtpg try to check for BUG_ON(!h->sdev); alua_rtpg_work -> alua_rtpg ---- 505 static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg) 506 { . . . 659 list_for_each_entry_rcu(h, 660 &tmp_pg->dh_list, node) { 661 /* h->sdev should always be valid */ *662* BUG_ON(!h->sdev); <<<< 2nd call trace caused the panic due to this bug on. 663 h->sdev->access_state = desc[0]; 664 } 665 rcu_read_unlock(); 666 } ---- So it looks, alua_rtpg_work triggered to alua_rtpg. alua_rtpg function is in its initial execution but didn't reach to line number 662. Meanwhile alua_bus_detach thread comes in and executes line no. 1150 which set sdev to null. Now, if alua_rtpg reaches to line 662 then it would cause the BUG_ON which will result in panic. --- Jitendra