Re: [PATCH] scsi: alua: fix the race between alua_bus_detach and alua_rtpg

Jitendra Khasdev <jitendra.khasdev@xxxxxxxxxx> · Fri, 30 Oct 2020 01:25:19 +0530



On Thu, Sep 24, 2020 at 12:39:48PM +0200, Hannes Reinecke wrote:
> On 9/23/20 7:50 PM, jitendra.khasdev@xxxxxxxxxx wrote:
> >
> >
> >On 9/23/20 1:47 PM, Hannes Reinecke wrote:
> >>On 9/18/20 5:49 AM, jitendra.khasdev@xxxxxxxxxx wrote:
> >>>
> >>>
> >>>On 9/17/20 11:00 PM, Ewan D. Milne wrote:
> >>>>On Tue, 2020-09-15 at 16:28 +0530, Jitendra Khasdev wrote:
> >>>>>This is patch to fix the race occurs between bus detach and alua_rtpg.
> >>>>>
> >>>>>It fluses the all pending workqueue in bus detach handler, so it can avoid
> >>>>>race between alua_bus_detach and alua_rtpg.
> >>>>>
> >>>>>Here is call trace where race got detected.
> >>>>>
> >>>>>multipathd call stack:
> >>>>>[exception RIP: native_queued_spin_lock_slowpath+100]
> >>>>>--- <NMI exception stack> ---
> >>>>>native_queued_spin_lock_slowpath at ffffffff89307f54
> >>>>>queued_spin_lock_slowpath at ffffffff89307c18
> >>>>>_raw_spin_lock_irq at ffffffff89bd797b
> >>>>>alua_bus_detach at ffffffff8984dcc8
> >>>>>scsi_dh_release_device at ffffffff8984b6f2
> >>>>>scsi_device_dev_release_usercontext at ffffffff89846edf
> >>>>>execute_in_process_context at ffffffff892c3e60
> >>>>>scsi_device_dev_release at ffffffff8984637c
> >>>>>device_release at ffffffff89800fbc
> >>>>>kobject_cleanup at ffffffff89bb1196
> >>>>>kobject_put at ffffffff89bb12ea
> >>>>>put_device at ffffffff89801283
> >>>>>scsi_device_put at ffffffff89838d5b
> >>>>>scsi_disk_put at ffffffffc051f650 [sd_mod]
> >>>>>sd_release at ffffffffc051f8a2 [sd_mod]
> >>>>>__blkdev_put at ffffffff8952c79e
> >>>>>blkdev_put at ffffffff8952c80c
> >>>>>blkdev_close at ffffffff8952c8b5
> >>>>>__fput at ffffffff894e55e6
> >>>>>____fput at ffffffff894e57ee
> >>>>>task_work_run at ffffffff892c94dc
> >>>>>exit_to_usermode_loop at ffffffff89204b12
> >>>>>do_syscall_64 at ffffffff892044da
> >>>>>entry_SYSCALL_64_after_hwframe at ffffffff89c001b8
> >>>>>
> >>>>>kworker:
> >>>>>[exception RIP: alua_rtpg+2003]
> >>>>>account_entity_dequeue at ffffffff892e42c1
> >>>>>alua_rtpg_work at ffffffff8984f097
> >>>>>process_one_work at ffffffff892c4c29
> >>>>>worker_thread at ffffffff892c5a4f
> >>>>>kthread at ffffffff892cb135
> >>>>>ret_from_fork at ffffffff89c00354
> >>>>>
> >>>>>Signed-off-by: Jitendra Khasdev <jitendra.khasdev@xxxxxxxxxx>
> >>>>>---
> >>>>>   drivers/scsi/device_handler/scsi_dh_alua.c | 3 +++
> >>>>>   1 file changed, 3 insertions(+)
> >>>>>
> >>>>>diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c
> >>>>>index f32da0c..024a752 100644
> >>>>>--- a/drivers/scsi/device_handler/scsi_dh_alua.c
> >>>>>+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
> >>>>>@@ -1144,6 +1144,9 @@ static void alua_bus_detach(struct scsi_device *sdev)
> >>>>>       struct alua_dh_data *h = sdev->handler_data;
> >>>>>       struct alua_port_group *pg;
> >>>>>   +    sdev_printk(KERN_INFO, sdev, "%s: flushing workqueues\n", ALUA_DH_NAME);
> >>>>>+    flush_workqueue(kaluad_wq);
> >>>>>+
> >>>>>       spin_lock(&h->pg_lock);
> >>>>>       pg = rcu_dereference_protected(h->pg, lockdep_is_held(&h->pg_lock));
> >>>>>       rcu_assign_pointer(h->pg, NULL);
> >>>>
> >>>>I'm not sure this is the best solution.  The current code
> >>>>references h->sdev when the dh_list is traversed.  So it needs
> >>>>to remain valid.  Fixing it by flushing the workqueue to avoid
> >>>>the list traversal code running leaves open the possibility that
> >>>>future code alterations may expose this problem again.
> >>>>
> >>>>-Ewan
> >>>>
> >>>>
> >>>
> >>>I see your point, but as we are in detach handler and this code path
> >>>only execute when device is being detached. So, before detaching, flush
> >>>work-queue will take care of any current code references h->sdev where
> >>>dh_list is being traversed.
> >>>
> >>Flushing the workqueue is a bit of an overkill, seeing that we know exactly which workqueue element we're waiting for.
> >>
> >>>IMO, I do not think it would create any problem for future code
> >>>alterations. Or may be I am missing something over here, what could
> >>>be possible scenario for that?
> >>>
> >>Problem is more that I'd like to understand where exactly the race condition is. Can you figure out which spinlock is triggering in your stack trace?
> >>
> >>Cheers,
> >>
> >>Hannes
> >
> >Hannes,
> >
> >Race is between "alua_bus_detach" and "alua_rtpg_work".
> >
> >Whenever we perform fail-over or turn off the switch, the path goes down, which eventually triggers
> >blkdev_put -> .. -> scsi_device_dev_release -> .. ->  alua_bus_detach meanwhile another thread of alua_rtpg_work also running in parallel. Both threads are using sdev.
> >
> >In alua_bus_detach, we are setting null to sdev. From above call trace (multipathd) we can see alua_bus_deatch ran first and set sdev to null. It keeps its execution continue and it does not have any problem.
> >
> >1138 /*
> >1139  * alua_bus_detach - Detach device handler
> >1140  * @sdev: device to be detached from
> >1141  */
> >1142 static void alua_bus_detach(struct scsi_device *sdev)
> >1143 {
> >1144         struct alua_dh_data *h = sdev->handler_data;
> >1145         struct alua_port_group *pg;
> >1146
> >1147         spin_lock(&h->pg_lock);
> >1148         pg = rcu_dereference_protected(h->pg, lockdep_is_held(&h->pg_lock));
> >1149         rcu_assign_pointer(h->pg, NULL);
> >*1150*         h->sdev = NULL;  << Looks detach handler won the race and set sdev to null
> >1151         spin_unlock(&h->pg_lock);
> >1152         if (pg) {
> >1153                 spin_lock_irq(&pg->lock); <<< from the call trace we can see that we just acquired the lock and got NMI
> >exception because we encountered a BUG_ON from different thread.
> >1154                 list_del_rcu(&h->node);
> >
> >
> >Meanwhile alua_rtpg try to check for BUG_ON(!h->sdev);
> >
> >alua_rtpg_work -> alua_rtpg
> >----
> >  505 static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg)
> >  506 {
> >  .
> >  .
> >  .
> >  659                                         list_for_each_entry_rcu(h,
> >  660                                                 &tmp_pg->dh_list, node) {
> >  661                                                 /* h->sdev should always be valid */
> >  *662*                                                 BUG_ON(!h->sdev); <<<< 2nd call trace caused the panic due to this bug on.
> >  663                                                 h->sdev->access_state = desc[0];
> >  664                                         }
> >  665                                         rcu_read_unlock();
> >  666                                 }
> >----
> >
> Ah, yes.
> 
> We would need to take 'h->lock' here before checking 'h->sdev'.
> Alternatively, we should be able to fix it by not setting h->sdev to
> NULL, and issuing rcu_synchronize() before issuing kfree(h):
> 
> @@ -1147,7 +1148,6 @@ static void alua_bus_detach(struct scsi_device *sdev)
>         spin_lock(&h->pg_lock);
>         pg = rcu_dereference_protected(h->pg,
> lockdep_is_held(&h->pg_lock));
>         rcu_assign_pointer(h->pg, NULL);
> -       h->sdev = NULL;
>         spin_unlock(&h->pg_lock);
>         if (pg) {
>                 spin_lock_irq(&pg->lock);
> @@ -1156,6 +1156,7 @@ static void alua_bus_detach(struct scsi_device *sdev)
>                 kref_put(&pg->kref, release_port_group);
>         }
>         sdev->handler_data = NULL;
> +       rcu_synchronize();
>         kfree(h);
>  }
> 
> The 'rcu_synchronize()' will ensure that any concurrent thread has
> left the rcu-critical section (ie the loop mentioned above), and the
> issue will be avoided.
> Additionally, we could replace the BUG_ON() with
> 
> if (!h->sdev)
>     continue;
> 
> and the problem should be solved.
> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke		           Kernel Storage Architect
> hare@xxxxxxx			                  +49 911 74053 688
> SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

> From eb295823f4604a939d89cb21aad468fcd393484b Mon Sep 17 00:00:00 2001
> From: Hannes Reinecke <hare@xxxxxxx>
> Date: Thu, 24 Sep 2020 12:33:22 +0200
> Subject: [PATCH] scsi_dh_alua: avoid crash during alua_bus_detach()
> 
> alua_bus_detach() might be running concurrently with alua_rtpg_work(),
> so we might trip over h->sdev == NULL and call BUG_ON().
> The correct way of handling it would be to not set h->sdev to NULL
> in alua_bus_detach(), and call rcu_synchronize() before the final
> delete to ensure that all concurrent threads have left the critical
> section.
> Then we can get rid of the BUG_ON(), and replace it with a simple
> if condition.
> 
> Signed-off-by: Hannes Reinecke <hare@xxxxxxx>
> ---

Reviewed-by: Jitendra Khasdev <jitendra.khasdev@xxxxxxxxxx>
Tested-by: Jitendra Khasdev <jitendra.khasdev@xxxxxxxxxx>

>  drivers/scsi/device_handler/scsi_dh_alua.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c
> index f32da0ca529e..308bda2e9c00 100644
> --- a/drivers/scsi/device_handler/scsi_dh_alua.c
> +++ b/drivers/scsi/device_handler/scsi_dh_alua.c
> @@ -658,8 +658,8 @@ static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg)
>  					rcu_read_lock();
>  					list_for_each_entry_rcu(h,
>  						&tmp_pg->dh_list, node) {
> -						/* h->sdev should always be valid */
> -						BUG_ON(!h->sdev);
> +						if (!h->sdev)
> +							continue;
>  						h->sdev->access_state = desc[0];
>  					}
>  					rcu_read_unlock();
> @@ -705,7 +705,8 @@ static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg)
>  			pg->expiry = 0;
>  			rcu_read_lock();
>  			list_for_each_entry_rcu(h, &pg->dh_list, node) {
> -				BUG_ON(!h->sdev);
> +				if (!h->sdev)
> +					continue;
>  				h->sdev->access_state =
>  					(pg->state & SCSI_ACCESS_STATE_MASK);
>  				if (pg->pref)
> @@ -1147,7 +1148,6 @@ static void alua_bus_detach(struct scsi_device *sdev)
>  	spin_lock(&h->pg_lock);
>  	pg = rcu_dereference_protected(h->pg, lockdep_is_held(&h->pg_lock));
>  	rcu_assign_pointer(h->pg, NULL);
> -	h->sdev = NULL;
>  	spin_unlock(&h->pg_lock);
>  	if (pg) {
>  		spin_lock_irq(&pg->lock);
> @@ -1156,6 +1156,7 @@ static void alua_bus_detach(struct scsi_device *sdev)
>  		kref_put(&pg->kref, release_port_group);
>  	}
>  	sdev->handler_data = NULL;
> +	synchronize_rcu();
>  	kfree(h);
>  }
>  
> -- 
> 2.16.4
>