Re: [PATCH] scsi: alua: fix the race between alua_bus_detach and alua_rtpg

jitendra.khasdev@xxxxxxxxxx · Tue, 29 Sep 2020 23:14:58 +0530

On 9/24/20 4:09 PM, Hannes Reinecke wrote:
> On 9/23/20 7:50 PM, jitendra.khasdev@xxxxxxxxxx wrote:
>>
>>
>> On 9/23/20 1:47 PM, Hannes Reinecke wrote:
>>> On 9/18/20 5:49 AM, jitendra.khasdev@xxxxxxxxxx wrote:
>>>>
>>>>
>>>> On 9/17/20 11:00 PM, Ewan D. Milne wrote:
>>>>> On Tue, 2020-09-15 at 16:28 +0530, Jitendra Khasdev wrote:
>>>>>> This is patch to fix the race occurs between bus detach and alua_rtpg.
>>>>>>
>>>>>> It fluses the all pending workqueue in bus detach handler, so it can avoid
>>>>>> race between alua_bus_detach and alua_rtpg.
>>>>>>
>>>>>> Here is call trace where race got detected.
>>>>>>
>>>>>> multipathd call stack:
>>>>>> [exception RIP: native_queued_spin_lock_slowpath+100]
>>>>>> --- <NMI exception stack> ---
>>>>>> native_queued_spin_lock_slowpath at ffffffff89307f54
>>>>>> queued_spin_lock_slowpath at ffffffff89307c18
>>>>>> _raw_spin_lock_irq at ffffffff89bd797b
>>>>>> alua_bus_detach at ffffffff8984dcc8
>>>>>> scsi_dh_release_device at ffffffff8984b6f2
>>>>>> scsi_device_dev_release_usercontext at ffffffff89846edf
>>>>>> execute_in_process_context at ffffffff892c3e60
>>>>>> scsi_device_dev_release at ffffffff8984637c
>>>>>> device_release at ffffffff89800fbc
>>>>>> kobject_cleanup at ffffffff89bb1196
>>>>>> kobject_put at ffffffff89bb12ea
>>>>>> put_device at ffffffff89801283
>>>>>> scsi_device_put at ffffffff89838d5b
>>>>>> scsi_disk_put at ffffffffc051f650 [sd_mod]
>>>>>> sd_release at ffffffffc051f8a2 [sd_mod]
>>>>>> __blkdev_put at ffffffff8952c79e
>>>>>> blkdev_put at ffffffff8952c80c
>>>>>> blkdev_close at ffffffff8952c8b5
>>>>>> __fput at ffffffff894e55e6
>>>>>> ____fput at ffffffff894e57ee
>>>>>> task_work_run at ffffffff892c94dc
>>>>>> exit_to_usermode_loop at ffffffff89204b12
>>>>>> do_syscall_64 at ffffffff892044da
>>>>>> entry_SYSCALL_64_after_hwframe at ffffffff89c001b8
>>>>>>
>>>>>> kworker:
>>>>>> [exception RIP: alua_rtpg+2003]
>>>>>> account_entity_dequeue at ffffffff892e42c1
>>>>>> alua_rtpg_work at ffffffff8984f097
>>>>>> process_one_work at ffffffff892c4c29
>>>>>> worker_thread at ffffffff892c5a4f
>>>>>> kthread at ffffffff892cb135
>>>>>> ret_from_fork at ffffffff89c00354
>>>>>>
>>>>>> Signed-off-by: Jitendra Khasdev <jitendra.khasdev@xxxxxxxxxx>
>>>>>> ---
>>>>>>    drivers/scsi/device_handler/scsi_dh_alua.c | 3 +++
>>>>>>    1 file changed, 3 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c
>>>>>> index f32da0c..024a752 100644
>>>>>> --- a/drivers/scsi/device_handler/scsi_dh_alua.c
>>>>>> +++ b/drivers/scsi/device_handler/scsi_dh_alua.c
>>>>>> @@ -1144,6 +1144,9 @@ static void alua_bus_detach(struct scsi_device *sdev)
>>>>>>        struct alua_dh_data *h = sdev->handler_data;
>>>>>>        struct alua_port_group *pg;
>>>>>>    +    sdev_printk(KERN_INFO, sdev, "%s: flushing workqueues\n", ALUA_DH_NAME);
>>>>>> +    flush_workqueue(kaluad_wq);
>>>>>> +
>>>>>>        spin_lock(&h->pg_lock);
>>>>>>        pg = rcu_dereference_protected(h->pg, lockdep_is_held(&h->pg_lock));
>>>>>>        rcu_assign_pointer(h->pg, NULL);
>>>>>
>>>>> I'm not sure this is the best solution.  The current code
>>>>> references h->sdev when the dh_list is traversed.  So it needs
>>>>> to remain valid.  Fixing it by flushing the workqueue to avoid
>>>>> the list traversal code running leaves open the possibility that
>>>>> future code alterations may expose this problem again.
>>>>>
>>>>> -Ewan
>>>>>
>>>>>
>>>>
>>>> I see your point, but as we are in detach handler and this code path
>>>> only execute when device is being detached. So, before detaching, flush
>>>> work-queue will take care of any current code references h->sdev where
>>>> dh_list is being traversed.
>>>>
>>> Flushing the workqueue is a bit of an overkill, seeing that we know exactly which workqueue element we're waiting for.
>>>
>>>> IMO, I do not think it would create any problem for future code
>>>> alterations. Or may be I am missing something over here, what could
>>>> be possible scenario for that?
>>>>
>>> Problem is more that I'd like to understand where exactly the race condition is. Can you figure out which spinlock is triggering in your stack trace?
>>>
>>> Cheers,
>>>
>>> Hannes
>>
>> Hannes,
>>
>> Race is between "alua_bus_detach" and "alua_rtpg_work".
>>
>> Whenever we perform fail-over or turn off the switch, the path goes down, which eventually triggers
>> blkdev_put -> .. -> scsi_device_dev_release -> .. ->  alua_bus_detach meanwhile another thread of alua_rtpg_work also running in parallel. Both threads are using sdev.
>>
>> In alua_bus_detach, we are setting null to sdev. From above call trace (multipathd) we can see alua_bus_deatch ran first and set sdev to null. It keeps its execution continue and it does not have any problem.
>>
>> 1138 /*
>> 1139  * alua_bus_detach - Detach device handler
>> 1140  * @sdev: device to be detached from
>> 1141  */
>> 1142 static void alua_bus_detach(struct scsi_device *sdev)
>> 1143 {
>> 1144         struct alua_dh_data *h = sdev->handler_data;
>> 1145         struct alua_port_group *pg;
>> 1146
>> 1147         spin_lock(&h->pg_lock);
>> 1148         pg = rcu_dereference_protected(h->pg, lockdep_is_held(&h->pg_lock));
>> 1149         rcu_assign_pointer(h->pg, NULL);
>> *1150*         h->sdev = NULL;  << Looks detach handler won the race and set sdev to null
>> 1151         spin_unlock(&h->pg_lock);
>> 1152         if (pg) {
>> 1153                 spin_lock_irq(&pg->lock); <<< from the call trace we can see that we just acquired the lock and got NMI
>> exception because we encountered a BUG_ON from different thread.
>> 1154                 list_del_rcu(&h->node);
>>
>>
>> Meanwhile alua_rtpg try to check for BUG_ON(!h->sdev);
>>
>> alua_rtpg_work -> alua_rtpg
>> ----
>>   505 static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg)
>>   506 {
>>   .
>>   .
>>   .
>>   659                                         list_for_each_entry_rcu(h,
>>   660                                                 &tmp_pg->dh_list, node) {
>>   661                                                 /* h->sdev should always be valid */
>>   *662*                                                 BUG_ON(!h->sdev); <<<< 2nd call trace caused the panic due to this bug on.
>>   663                                                 h->sdev->access_state = desc[0];
>>   664                                         }
>>   665                                         rcu_read_unlock();
>>   666                                 }
>> ----
>>
> Ah, yes.
> 
> We would need to take 'h->lock' here before checking 'h->sdev'.
> Alternatively, we should be able to fix it by not setting h->sdev to NULL, and issuing rcu_synchronize() before issuing kfree(h):
> 
> @@ -1147,7 +1148,6 @@ static void alua_bus_detach(struct scsi_device *sdev)
>         spin_lock(&h->pg_lock);
>         pg = rcu_dereference_protected(h->pg, lockdep_is_held(&h->pg_lock));
>         rcu_assign_pointer(h->pg, NULL);
> -       h->sdev = NULL;
>         spin_unlock(&h->pg_lock);
>         if (pg) {
>                 spin_lock_irq(&pg->lock);
> @@ -1156,6 +1156,7 @@ static void alua_bus_detach(struct scsi_device *sdev)
>                 kref_put(&pg->kref, release_port_group);
>         }
>         sdev->handler_data = NULL;
> +       rcu_synchronize();
>         kfree(h);
>  }
> 
> The 'rcu_synchronize()' will ensure that any concurrent thread has left the rcu-critical section (ie the loop mentioned above), and the issue will be avoided.
> Additionally, we could replace the BUG_ON() with
> 
> if (!h->sdev)
>     continue;
> 
> and the problem should be solved.
> 
> Cheers,
> 
> Hannes

Thanks Hannes for patch, I am currently testing it.

---
Jitendra