Re: [PATCH] scsi: alua: fix the race between alua_bus_detach and alua_rtpg

jitendra.khasdev@xxxxxxxxxx · Wed, 23 Sep 2020 23:20:49 +0530

On 9/23/20 1:47 PM, Hannes Reinecke wrote:
> On 9/18/20 5:49 AM, jitendra.khasdev@xxxxxxxxxx wrote:
>>
>>
>> On 9/17/20 11:00 PM, Ewan D. Milne wrote:
>>> On Tue, 2020-09-15 at 16:28 +0530, Jitendra Khasdev wrote:
>>>> This is patch to fix the race occurs between bus detach and alua_rtpg.
>>>>
>>>> It fluses the all pending workqueue in bus detach handler, so it can avoid
>>>> race between alua_bus_detach and alua_rtpg.
>>>>
>>>> Here is call trace where race got detected.
>>>>
>>>> multipathd call stack:
>>>> [exception RIP: native_queued_spin_lock_slowpath+100]
>>>> --- <NMI exception stack> ---
>>>> native_queued_spin_lock_slowpath at ffffffff89307f54
>>>> queued_spin_lock_slowpath at ffffffff89307c18
>>>> _raw_spin_lock_irq at ffffffff89bd797b
>>>> alua_bus_detach at ffffffff8984dcc8
>>>> scsi_dh_release_device at ffffffff8984b6f2
>>>> scsi_device_dev_release_usercontext at ffffffff89846edf
>>>> execute_in_process_context at ffffffff892c3e60
>>>> scsi_device_dev_release at ffffffff8984637c
>>>> device_release at ffffffff89800fbc
>>>> kobject_cleanup at ffffffff89bb1196
>>>> kobject_put at ffffffff89bb12ea
>>>> put_device at ffffffff89801283
>>>> scsi_device_put at ffffffff89838d5b
>>>> scsi_disk_put at ffffffffc051f650 [sd_mod]
>>>> sd_release at ffffffffc051f8a2 [sd_mod]
>>>> __blkdev_put at ffffffff8952c79e
>>>> blkdev_put at ffffffff8952c80c
>>>> blkdev_close at ffffffff8952c8b5
>>>> __fput at ffffffff894e55e6
>>>> ____fput at ffffffff894e57ee
>>>> task_work_run at ffffffff892c94dc
>>>> exit_to_usermode_loop at ffffffff89204b12
>>>> do_syscall_64 at ffffffff892044da
>>>> entry_SYSCALL_64_after_hwframe at ffffffff89c001b8
>>>>
>>>> kworker:
>>>> [exception RIP: alua_rtpg+2003]
>>>> account_entity_dequeue at ffffffff892e42c1
>>>> alua_rtpg_work at ffffffff8984f097
>>>> process_one_work at ffffffff892c4c29
>>>> worker_thread at ffffffff892c5a4f
>>>> kthread at ffffffff892cb135
>>>> ret_from_fork at ffffffff89c00354
>>>>
>>>> Signed-off-by: Jitendra Khasdev <jitendra.khasdev@xxxxxxxxxx>
>>>> ---
>>>>   drivers/scsi/device_handler/scsi_dh_alua.c | 3 +++
>>>>   1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c
>>>> index f32da0c..024a752 100644
>>>> --- a/drivers/scsi/device_handler/scsi_dh_alua.c
>>>> +++ b/drivers/scsi/device_handler/scsi_dh_alua.c
>>>> @@ -1144,6 +1144,9 @@ static void alua_bus_detach(struct scsi_device *sdev)
>>>>       struct alua_dh_data *h = sdev->handler_data;
>>>>       struct alua_port_group *pg;
>>>>   +    sdev_printk(KERN_INFO, sdev, "%s: flushing workqueues\n", ALUA_DH_NAME);
>>>> +    flush_workqueue(kaluad_wq);
>>>> +
>>>>       spin_lock(&h->pg_lock);
>>>>       pg = rcu_dereference_protected(h->pg, lockdep_is_held(&h->pg_lock));
>>>>       rcu_assign_pointer(h->pg, NULL);
>>>
>>> I'm not sure this is the best solution.  The current code
>>> references h->sdev when the dh_list is traversed.  So it needs
>>> to remain valid.  Fixing it by flushing the workqueue to avoid
>>> the list traversal code running leaves open the possibility that
>>> future code alterations may expose this problem again.
>>>
>>> -Ewan
>>>
>>>
>>
>> I see your point, but as we are in detach handler and this code path
>> only execute when device is being detached. So, before detaching, flush
>> work-queue will take care of any current code references h->sdev where
>> dh_list is being traversed.
>>
> Flushing the workqueue is a bit of an overkill, seeing that we know exactly which workqueue element we're waiting for.
> 
>> IMO, I do not think it would create any problem for future code
>> alterations. Or may be I am missing something over here, what could
>> be possible scenario for that?
>>
> Problem is more that I'd like to understand where exactly the race condition is. Can you figure out which spinlock is triggering in your stack trace?
> 
> Cheers,
> 
> Hannes

Hannes,

Race is between "alua_bus_detach" and "alua_rtpg_work". 

Whenever we perform fail-over or turn off the switch, the path goes down, which eventually triggers
blkdev_put -> .. -> scsi_device_dev_release -> .. ->  alua_bus_detach meanwhile another thread of alua_rtpg_work also running in parallel. Both threads are using sdev.

In alua_bus_detach, we are setting null to sdev. From above call trace (multipathd) we can see alua_bus_deatch ran first and set sdev to null. It keeps its execution continue and it does not have any problem.  

1138 /*                                                                              
1139  * alua_bus_detach - Detach device handler                                      
1140  * @sdev: device to be detached from                                            
1141  */                                                                             
1142 static void alua_bus_detach(struct scsi_device *sdev)                           
1143 {                                                                               
1144         struct alua_dh_data *h = sdev->handler_data;                            
1145         struct alua_port_group *pg;                                             
1146                                                                                 
1147         spin_lock(&h->pg_lock);                                                 
1148         pg = rcu_dereference_protected(h->pg, lockdep_is_held(&h->pg_lock));    
1149         rcu_assign_pointer(h->pg, NULL);                                        
*1150*         h->sdev = NULL;  << Looks detach handler won the race and set sdev to null                                                         
1151         spin_unlock(&h->pg_lock);                                               
1152         if (pg) {                                                               
1153                 spin_lock_irq(&pg->lock); <<< from the call trace we can see that we just acquired the lock and got NMI 
exception because we encountered a BUG_ON from different thread.                                        
1154                 list_del_rcu(&h->node);        

Meanwhile alua_rtpg try to check for BUG_ON(!h->sdev); 

alua_rtpg_work -> alua_rtpg
----
 505 static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg)
 506 {
 .
 .
 .
 659                                         list_for_each_entry_rcu(h,              
 660                                                 &tmp_pg->dh_list, node) {       
 661                                                 /* h->sdev should always be valid */
 *662*                                                 BUG_ON(!h->sdev); <<<< 2nd call trace caused the panic due to this bug on.              
 663                                                 h->sdev->access_state = desc[0];
 664                                         }                                       
 665                                         rcu_read_unlock();                      
 666                                 }                                 
----

So it looks, alua_rtpg_work triggered to alua_rtpg. alua_rtpg function is in its initial execution but didn't reach to line number 662. Meanwhile alua_bus_detach thread comes in and executes line no. 1150 which set sdev to null. Now, if alua_rtpg reaches to line 662 then it would cause the BUG_ON which will result in panic.

---
Jitendra