Re: [PATCH] scsi: alua: fix the race between alua_bus_detach and alua_rtpg

Hannes Reinecke <hare@xxxxxxx> · Thu, 24 Sep 2020 12:39:48 +0200

On 9/23/20 7:50 PM, jitendra.khasdev@xxxxxxxxxx wrote:


On 9/23/20 1:47 PM, Hannes Reinecke wrote:
On 9/18/20 5:49 AM, jitendra.khasdev@xxxxxxxxxx wrote:


On 9/17/20 11:00 PM, Ewan D. Milne wrote:
On Tue, 2020-09-15 at 16:28 +0530, Jitendra Khasdev wrote:
This is patch to fix the race occurs between bus detach and alua_rtpg.

It fluses the all pending workqueue in bus detach handler, so it can avoid
race between alua_bus_detach and alua_rtpg.

Here is call trace where race got detected.

multipathd call stack:
[exception RIP: native_queued_spin_lock_slowpath+100]
--- <NMI exception stack> ---
native_queued_spin_lock_slowpath at ffffffff89307f54
queued_spin_lock_slowpath at ffffffff89307c18
_raw_spin_lock_irq at ffffffff89bd797b
alua_bus_detach at ffffffff8984dcc8
scsi_dh_release_device at ffffffff8984b6f2
scsi_device_dev_release_usercontext at ffffffff89846edf
execute_in_process_context at ffffffff892c3e60
scsi_device_dev_release at ffffffff8984637c
device_release at ffffffff89800fbc
kobject_cleanup at ffffffff89bb1196
kobject_put at ffffffff89bb12ea
put_device at ffffffff89801283
scsi_device_put at ffffffff89838d5b
scsi_disk_put at ffffffffc051f650 [sd_mod]
sd_release at ffffffffc051f8a2 [sd_mod]
__blkdev_put at ffffffff8952c79e
blkdev_put at ffffffff8952c80c
blkdev_close at ffffffff8952c8b5
__fput at ffffffff894e55e6
____fput at ffffffff894e57ee
task_work_run at ffffffff892c94dc
exit_to_usermode_loop at ffffffff89204b12
do_syscall_64 at ffffffff892044da
entry_SYSCALL_64_after_hwframe at ffffffff89c001b8

kworker:
[exception RIP: alua_rtpg+2003]
account_entity_dequeue at ffffffff892e42c1
alua_rtpg_work at ffffffff8984f097
process_one_work at ffffffff892c4c29
worker_thread at ffffffff892c5a4f
kthread at ffffffff892cb135
ret_from_fork at ffffffff89c00354

Signed-off-by: Jitendra Khasdev <jitendra.khasdev@xxxxxxxxxx>
---
   drivers/scsi/device_handler/scsi_dh_alua.c | 3 +++
   1 file changed, 3 insertions(+)

diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c
index f32da0c..024a752 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -1144,6 +1144,9 @@ static void alua_bus_detach(struct scsi_device *sdev)
       struct alua_dh_data *h = sdev->handler_data;
       struct alua_port_group *pg;
   +    sdev_printk(KERN_INFO, sdev, "%s: flushing workqueues\n", ALUA_DH_NAME);
+    flush_workqueue(kaluad_wq);
+
       spin_lock(&h->pg_lock);
       pg = rcu_dereference_protected(h->pg, lockdep_is_held(&h->pg_lock));
       rcu_assign_pointer(h->pg, NULL);

I'm not sure this is the best solution.  The current code
references h->sdev when the dh_list is traversed.  So it needs
to remain valid.  Fixing it by flushing the workqueue to avoid
the list traversal code running leaves open the possibility that
future code alterations may expose this problem again.

-Ewan



I see your point, but as we are in detach handler and this code path
only execute when device is being detached. So, before detaching, flush
work-queue will take care of any current code references h->sdev where
dh_list is being traversed.

Flushing the workqueue is a bit of an overkill, seeing that we know exactly which workqueue element we're waiting for.

IMO, I do not think it would create any problem for future code
alterations. Or may be I am missing something over here, what could
be possible scenario for that?

Problem is more that I'd like to understand where exactly the race condition is. Can you figure out which spinlock is triggering in your stack trace?

Cheers,

Hannes

Hannes,

Race is between "alua_bus_detach" and "alua_rtpg_work".

Whenever we perform fail-over or turn off the switch, the path goes down, which eventually triggers
blkdev_put -> .. -> scsi_device_dev_release -> .. ->  alua_bus_detach meanwhile another thread of alua_rtpg_work also running in parallel. Both threads are using sdev.

In alua_bus_detach, we are setting null to sdev. From above call trace (multipathd) we can see alua_bus_deatch ran first and set sdev to null. It keeps its execution continue and it does not have any problem.

1138 /*
1139  * alua_bus_detach - Detach device handler
1140  * @sdev: device to be detached from
1141  */
1142 static void alua_bus_detach(struct scsi_device *sdev)
1143 {
1144         struct alua_dh_data *h = sdev->handler_data;
1145         struct alua_port_group *pg;
1146
1147         spin_lock(&h->pg_lock);
1148         pg = rcu_dereference_protected(h->pg, lockdep_is_held(&h->pg_lock));
1149         rcu_assign_pointer(h->pg, NULL);
*1150*         h->sdev = NULL;  << Looks detach handler won the race and set sdev to null
1151         spin_unlock(&h->pg_lock);
1152         if (pg) {
1153                 spin_lock_irq(&pg->lock); <<< from the call trace we can see that we just acquired the lock and got NMI
exception because we encountered a BUG_ON from different thread.
1154                 list_del_rcu(&h->node);


Meanwhile alua_rtpg try to check for BUG_ON(!h->sdev);

alua_rtpg_work -> alua_rtpg
----
  505 static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg)
  506 {
  .
  .
  .
  659                                         list_for_each_entry_rcu(h,
  660                                                 &tmp_pg->dh_list, node) {
  661                                                 /* h->sdev should always be valid */
  *662*                                                 BUG_ON(!h->sdev); <<<< 2nd call trace caused the panic due to this bug on.
  663                                                 h->sdev->access_state = desc[0];
  664                                         }
  665                                         rcu_read_unlock();
  666                                 }
----

Ah, yes.

We would need to take 'h->lock' here before checking 'h->sdev'.
Alternatively, we should be able to fix it by not setting h->sdev to 
NULL, and issuing rcu_synchronize() before issuing kfree(h):

@@ -1147,7 +1148,6 @@ static void alua_bus_detach(struct scsi_device *sdev)
        spin_lock(&h->pg_lock);
        pg = rcu_dereference_protected(h->pg, 
lockdep_is_held(&h->pg_lock));
        rcu_assign_pointer(h->pg, NULL);
-       h->sdev = NULL;
        spin_unlock(&h->pg_lock);
        if (pg) {
                spin_lock_irq(&pg->lock);
@@ -1156,6 +1156,7 @@ static void alua_bus_detach(struct scsi_device *sdev)
                kref_put(&pg->kref, release_port_group);
        }
        sdev->handler_data = NULL;
+       rcu_synchronize();
        kfree(h);
 }

The 'rcu_synchronize()' will ensure that any concurrent thread has left 
the rcu-critical section (ie the loop mentioned above), and the issue 
will be avoided.
Additionally, we could replace the BUG_ON() with

if (!h->sdev)
    continue;

and the problem should be solved.

Cheers,

Hannes
--
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@xxxxxxx			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
>From eb295823f4604a939d89cb21aad468fcd393484b Mon Sep 17 00:00:00 2001
From: Hannes Reinecke <hare@xxxxxxx>
Date: Thu, 24 Sep 2020 12:33:22 +0200
Subject: [PATCH] scsi_dh_alua: avoid crash during alua_bus_detach()

alua_bus_detach() might be running concurrently with alua_rtpg_work(),
so we might trip over h->sdev == NULL and call BUG_ON().
The correct way of handling it would be to not set h->sdev to NULL
in alua_bus_detach(), and call rcu_synchronize() before the final
delete to ensure that all concurrent threads have left the critical
section.
Then we can get rid of the BUG_ON(), and replace it with a simple
if condition.

Signed-off-by: Hannes Reinecke <hare@xxxxxxx>
---
 drivers/scsi/device_handler/scsi_dh_alua.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c
index f32da0ca529e..308bda2e9c00 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -658,8 +658,8 @@ static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg)
 					rcu_read_lock();
 					list_for_each_entry_rcu(h,
 						&tmp_pg->dh_list, node) {
-						/* h->sdev should always be valid */
-						BUG_ON(!h->sdev);
+						if (!h->sdev)
+							continue;
 						h->sdev->access_state = desc[0];
 					}
 					rcu_read_unlock();
@@ -705,7 +705,8 @@ static int alua_rtpg(struct scsi_device *sdev, struct alua_port_group *pg)
 			pg->expiry = 0;
 			rcu_read_lock();
 			list_for_each_entry_rcu(h, &pg->dh_list, node) {
-				BUG_ON(!h->sdev);
+				if (!h->sdev)
+					continue;
 				h->sdev->access_state =
 					(pg->state & SCSI_ACCESS_STATE_MASK);
 				if (pg->pref)
@@ -1147,7 +1148,6 @@ static void alua_bus_detach(struct scsi_device *sdev)
 	spin_lock(&h->pg_lock);
 	pg = rcu_dereference_protected(h->pg, lockdep_is_held(&h->pg_lock));
 	rcu_assign_pointer(h->pg, NULL);
-	h->sdev = NULL;
 	spin_unlock(&h->pg_lock);
 	if (pg) {
 		spin_lock_irq(&pg->lock);
@@ -1156,6 +1156,7 @@ static void alua_bus_detach(struct scsi_device *sdev)
 		kref_put(&pg->kref, release_port_group);
 	}
 	sdev->handler_data = NULL;
+	synchronize_rcu();
 	kfree(h);
 }
 
-- 
2.16.4