Re: [PATCH] scsi: alua: fix the race between alua_bus_detach and alua_rtpg

jitendra.khasdev@xxxxxxxxxx · Tue, 20 Oct 2020 18:43:16 +0530

Hi Hannes,

On 10/12/20 2:22 PM, jitendra.khasdev@xxxxxxxxxx wrote:
> Hi Hannes,
> 
> On 9/24/20 4:09 PM, Hannes Reinecke wrote:
>> On 9/23/20 7:50 PM, jitendra.khasdev@xxxxxxxxxx wrote:
>>>
>>>
>>> On 9/23/20 1:47 PM, Hannes Reinecke wrote:
>>>> On 9/18/20 5:49 AM, jitendra.khasdev@xxxxxxxxxx wrote:
>>>>>
>>>>>
>>>>> On 9/17/20 11:00 PM, Ewan D. Milne wrote:
>>>>>> On Tue, 2020-09-15 at 16:28 +0530, Jitendra Khasdev wrote:
>>>>>>> This is patch to fix the race occurs between bus detach and alua_rtpg.
>>>>>>>
>>>>>>> It fluses the all pending workqueue in bus detach handler, so it can avoid
>>>>>>> race between alua_bus_detach and alua_rtpg.
>>>>>>>
>>>>>>> Here is call trace where race got detected.
>>>>>>>
>>>>>>> multipathd call stack:
>>>>>>> [exception RIP: native_queued_spin_lock_slowpath+100]
>>>>>>> --- <NMI exception stack> ---
>>>>>>> native_queued_spin_lock_slowpath at ffffffff89307f54
>>>>>>> queued_spin_lock_slowpath at ffffffff89307c18
>>>>>>> _raw_spin_lock_irq at ffffffff89bd797b
>>>>>>> alua_bus_detach at ffffffff8984dcc8
>>>>>>> scsi_dh_release_device at ffffffff8984b6f2
>>>>>>> scsi_device_dev_release_usercontext at ffffffff89846edf
>>>>>>> execute_in_process_context at ffffffff892c3e60
>>>>>>> scsi_device_dev_release at ffffffff8984637c
>>>>>>> device_release at ffffffff89800fbc
>>>>>>> kobject_cleanup at ffffffff89bb1196
>>>>>>> kobject_put at ffffffff89bb12ea
>>>>>>> put_device at ffffffff89801283
>>>>>>> scsi_device_put at ffffffff89838d5b
>>>>>>> scsi_disk_put at ffffffffc051f650 [sd_mod]
>>>>>>> sd_release at ffffffffc051f8a2 [sd_mod]
>>>>>>> __blkdev_put at ffffffff8952c79e
>>>>>>> blkdev_put at ffffffff8952c80c
>>>>>>> blkdev_close at ffffffff8952c8b5
>>>>>>> __fput at ffffffff894e55e6
>>>>>>> ____fput at ffffffff894e57ee
>>>>>>> task_work_run at ffffffff892c94dc
>>>>>>> exit_to_usermode_loop at ffffffff89204b12
>>>>>>> do_syscall_64 at ffffffff892044da
>>>>>>> entry_SYSCALL_64_after_hwframe at ffffffff89c001b8
>>>>>>>
>>>>>>> kworker:
>>>>>>> [exception RIP: alua_rtpg+2003]
>>>>>>> account_entity_dequeue at ffffffff892e42c1
>>>>>>> alua_rtpg_work at ffffffff8984f097
>>>>>>> process_one_work at ffffffff892c4c29
>>>>>>> worker_thread at ffffffff892c5a4f
>>>>>>> kthread at ffffffff892cb135
>>>>>>> ret_from_fork at ffffffff89c00354
>>>>>>>
>>>>>>> Signed-off-by: Jitendra Khasdev <jitendra.khasdev@xxxxxxxxxx>
>>>>>>> ---
>>>>>>>    drivers/scsi/device_handler/scsi_dh_alua.c | 3 +++
>>>>>>>    1 file changed, 3 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c
>>>>>>> index f32da0c..024a752 100644
>>>>>>> --- a/drivers/scsi/device_handler/scsi_dh_alua.c
>>>>>>> +++ b/drivers/scsi/device_handler/scsi_dh_alua.c
>>>>>>> @@ -1144,6 +1144,9 @@ static void alua_bus_detach(struct scsi_device *sdev)
>>>>>>>        struct alua_dh_data *h = sdev->handler_data;
>>>>>>>        struct alua_port_group *pg;
>>>>>>>    +    sdev_printk(KERN_INFO, sdev, "%s: flushing workqueues\n", ALUA_DH_NAME);
>>>>>>> +    flush_workqueue(kaluad_wq);
>>>>>>> +
>>>>>>>        spin_lock(&h->pg_lock);
>>>>>>>        pg = rcu_dereference_protected(h->pg, lockdep_is_held(&h->pg_lock));
>>>>>>>        rcu_assign_pointer(h->pg, NULL);
>>>>>>
>>>>>> I'm not sure this is the best solution.  The current code
>>>>>> references h->sdev when the dh_list is traversed.  So it needs
>>>>>> to remain valid.  Fixing it by flushing the workqueue to avoid
>>>>>> the list traversal code running leaves open the possibility that
>>>>>> future code alterations may expose this problem again.
>>>>>>
>>>>>> -Ewan
>>>>>>
>>>>>>
>>>>>
>>>>> I see your point, but as we are in detach handler and this code path
>>>>> only execute when device is being detached. So, before detaching, flush
>>>>> work-queue will take care of any current code references h->sdev where
>>>>> dh_list is being traversed.
>>>>>
>>>> Flushing the workqueue is a bit of an overkill, seeing that we know exactly which workqueue element we're waiting for.
>>>>
>>>>> IMO, I do not think it would create any problem for future code
>>>>> alterations. Or may be I am missing something over here, what could
>>>>> be possible scenario for that?
>>>>>
>>>> Problem is more that I'd like to understand where exactly the race condition is. Can you figure out which spinlock is triggering in your stack trace?
>>>>
>>>> Cheers,
>>>>
>>>> Hannes
>>>
>>> Hannes,
>>>
>>> Race is between "alua_bus_detach" and "alua_rtpg_work".
>>>
>>> Whenever we perform fail-over or turn off the switch, the path goes down, which eventually triggers
>>> blkdev_put -> .. -> scsi_device_dev_release -> .. ->  alua_bus_detach meanwhile another thread of alua_rtpg_work also running in parallel. Both threads are using sdev.
>>>
>>> In alua_bus_detach, we are setting null to sdev. From above call trace (multipathd) we can see alua_bus_deatch ran first and set sdev to null. It keeps its execution continue and it does not have any problem.
>>>
>>> 1138 /*
>>> 1139  * alua_bus_detach - Detach device handler
>>> 1140  * @sdev: device to be detached from
>>> 1141  */
>>> 1142 static void alua_bus_detach(struct scsi_device *sdev)
>>> 1143 {
>>> 1144         struct alua_dh_data *h = sdev->handler_data;
>>> 1145         struct alua_port_group *pg;
>>> 1146
>>> 1147         spin_lock(&h->pg_lock);
>>> 1148         pg = rcu_dereference_protected(h->pg, lockdep_is_held(&h->pg_lock));
>>> 1149         rcu_assign_pointer(h->pg, NULL);
>>> *1150*         h->sdev = NULL;  << Looks detach handler won the race and set sdev to null
>>> 1151         spin_unlock(&h->pg_lock);
>>> 1152         if (pg) {
>>> 1153                 spin_lock_irq(&pg->lock); <<< from the call trace we can see that we just acquired the lock and got NMI
>>> exception because we encountered a BUG_ON from different thread.
>>> 1154                 list_del_rcu(&h->node);
>>>
>>>
>>> Meanwhile alua_rtpg try to check for BUG_ON(!h->sdev);
>>>
>>> alua_rtpg_work -> alua_rtpg
>>> ----
>>>   505 static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg)
>>>   506 {
>>>   .
>>>   .
>>>   .
>>>   659                                         list_for_each_entry_rcu(h,
>>>   660                                                 &tmp_pg->dh_list, node) {
>>>   661                                                 /* h->sdev should always be valid */
>>>   *662*                                                 BUG_ON(!h->sdev); <<<< 2nd call trace caused the panic due to this bug on.
>>>   663                                                 h->sdev->access_state = desc[0];
>>>   664                                         }
>>>   665                                         rcu_read_unlock();
>>>   666                                 }
>>> ----
>>>
>> Ah, yes.
>>
>> We would need to take 'h->lock' here before checking 'h->sdev'.
>> Alternatively, we should be able to fix it by not setting h->sdev to NULL, and issuing rcu_synchronize() before issuing kfree(h):
>>
>> @@ -1147,7 +1148,6 @@ static void alua_bus_detach(struct scsi_device *sdev)
>>         spin_lock(&h->pg_lock);
>>         pg = rcu_dereference_protected(h->pg, lockdep_is_held(&h->pg_lock));
>>         rcu_assign_pointer(h->pg, NULL);
>> -       h->sdev = NULL;
>>         spin_unlock(&h->pg_lock);
>>         if (pg) {
>>                 spin_lock_irq(&pg->lock);
>> @@ -1156,6 +1156,7 @@ static void alua_bus_detach(struct scsi_device *sdev)
>>                 kref_put(&pg->kref, release_port_group);
>>         }
>>         sdev->handler_data = NULL;
>> +       rcu_synchronize();
>>         kfree(h);
>>  }
>>
>> The 'rcu_synchronize()' will ensure that any concurrent thread has left the rcu-critical section (ie the loop mentioned above), and the issue will be avoided.
>> Additionally, we could replace the BUG_ON() with
>>
>> if (!h->sdev)
>>     continue;
>>
>> and the problem should be solved.
>>
>> Cheers,
>>
>> Hannes
> 
> 
> This patch works and avoid crash during fail-over. It looks good to me in testing.
> 
> ---
> Jitendra
> 

Gentle reminder, I am wondering if we can proceed to integrate this patch to mainline. 

---
Jitendra