Re: [PATCH -next 1/6] Revert "md: unlock mddev before reap sync_thread in action_store"

Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> · Wed, 22 Mar 2023 17:00:28 +0800

Hi,

在 2023/03/22 15:19, Guoqing Jiang 写道:


On 3/22/23 14:41, Yu Kuai wrote:
From: Yu Kuai <yukuai3@xxxxxxxxxx>

This reverts commit 9dfbdafda3b34e262e43e786077bab8e476a89d1.

Because it will introduce a defect that sync_thread can be running while
MD_RECOVERY_RUNNING is cleared, which will cause some unexpected 
problems,
for example:

list_add corruption. prev->next should be next (ffff0001ac1daba0), but 
was ffff0000ce1a02a0. (prev=ffff0000ce1a02a0).
Call trace:
  __list_add_valid+0xfc/0x140
  insert_work+0x78/0x1a0
  __queue_work+0x500/0xcf4
  queue_work_on+0xe8/0x12c
  md_check_recovery+0xa34/0xf30
  raid10d+0xb8/0x900 [raid10]
  md_thread+0x16c/0x2cc
  kthread+0x1a4/0x1ec
  ret_from_fork+0x10/0x18

This is because work is requeued while it's still inside workqueue:

If the workqueue subsystem can have such problem because of md flag,
then I have to think workqueue is fragile.

t1:            t2:
action_store
  mddev_lock
   if (mddev->sync_thread)
    mddev_unlock
    md_unregister_thread
    // first sync_thread is done
            md_check_recovery
             mddev_try_lock
             /*
              * once MD_RECOVERY_DONE is set, new sync_thread
              * can start.
              */
             set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
             INIT_WORK(&mddev->del_work, md_start_sync)
             queue_work(md_misc_wq, &mddev->del_work)
              test_and_set_bit(WORK_STRUCT_PENDING_BIT, ...)

Assume you mean below,

1551 if(!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
1552                 __queue_work(cpu, wq, work);
1553                 ret = true;
1554         }

Could you explain how the same work can be re-queued? Isn't the PENDING_BIT
is already set in t3? I believe queue_work shouldn't do that per the 
comment
but I am not expert ...

This is not related to workqueue, it is just because raid10
reinitialize the work that is already queued, like I discribed later
in t3:

t2:
md_check_recovery:
 INIT_WORK -> clear pending
 queue_work -> set pending
  list_add_tail
...

t3: -> work is still pending
md_check_recovery:
 INIT_WORK -> clear pending
 queue_work -> set pending
  list_add_tail -> list is corrupted


Returns %false if @work was already on a queue, %true otherwise.

              // set pending bit
              insert_work
               list_add_tail
             mddev_unlock
    mddev_lock_nointr
    md_reap_sync_thread
    // MD_RECOVERY_RUNNING is cleared
  mddev_unlock

t3:

// before queued work started from t2
md_check_recovery
  // MD_RECOVERY_RUNNING is not set, a new sync_thread can be started
  INIT_WORK(&mddev->del_work, md_start_sync)
   work->data = 0
   // work pending bit is cleared
  queue_work(md_misc_wq, &mddev->del_work)
   insert_work
    list_add_tail
    // list is corrupted

This patch revert the commit to fix the problem, the deadlock this
commit tries to fix will be fixed in following patches.

Pls cc the previous users who had encounter the problem to test the
second patch.

Ok, cc Marc. Can you try if this patchset fix the problem you reproted
in the following thread?

md_raid: mdX_raid6 looping after sync_action "check" to "idle"
transition

And can you share your test which can trigger the re-queued issue?
I'd like to try with latest mainline such as 6.3-rc3, and your test is
not only run against 5.10 kernel as you described before, right?


Of course, our 5.10 and mainline are the same,

there are some tests:

First the deadlock can be reporduced reliably, test script is simple:

mdadm -Cv /dev/md0 -n 4 -l10 /dev/sd[abcd]

fio -filename=/dev/md0 -rw=randwrite -direct=1 -name=a -bs=4k 
-numjobs=16 -iodepth=16 &

echo -1 > /sys/kernel/debug/fail_make_request/times
echo 1 > /sys/kernel/debug/fail_make_request/probability
echo 1 > /sys/block/sda/make-it-fail

{
        while true; do
                mdadm -f /dev/md0 /dev/sda
                mdadm -r /dev/md0 /dev/sda
                mdadm --zero-superblock /dev/sda
                mdadm -a /dev/md0 /dev/sda
                sleep 2
        done
} &

{
        while true; do
                mdadm -f /dev/md0 /dev/sdd
                mdadm -r /dev/md0 /dev/sdd
                mdadm --zero-superblock /dev/sdd
                mdadm -a /dev/md0 /dev/sdd
                sleep 10
        done
} &

{
        while true; do
                echo frozen > /sys/block/md0/md/sync_action
                echo idle > /sys/block/md0/md/sync_action
                sleep 0.1
        done
} &

Then, the problem MD_RECOVERY_RUNNING can be cleared can't be reporduced
reliably, usually it takes 2+ days to triggered a problem, and each time
problem phenomenon can be different, I'm hacking the kernel and add
some BUG_ON to test MD_RECOVERY_RUNNING in attached patch, following
test can trigger the BUG_ON:

mdadm -Cv /dev/md0 -e1.0 -n 4 -l 10 /dev/sd{a..d} --run
sleep 5
echo 1 > /sys/module/md_mod/parameters/set_delay
echo idle > /sys/block/md0/md/sync_action &
sleep 5
echo "want_replacement" > /sys/block/md0/md/dev-sdd/state

test result:

[  228.390237] md_check_recovery: running is set
[  228.391376] md_check_recovery: queue new sync thread
[  233.671041] action_store unregister success! delay 10s
[  233.689276] md_check_recovery: running is set
[  238.722448] md_check_recovery: running is set
[  238.723328] md_check_recovery: queue new sync thread
[  238.724851] md_do_sync: before new wor, sleep 10s
[  239.725818] md_do_sync: delay done
[  243.674828] action_store delay done
[  243.700102] md_reap_sync_thread: running is cleared!
[  243.748703] ------------[ cut here ]------------
[  243.749656] kernel BUG at drivers/md/md.c:9084!
[  243.750548] invalid opcode: 0000 [#1] PREEMPT SMP
[  243.752028] CPU: 6 PID: 1495 Comm: md0_resync Not tainted 
6.3.0-rc1-next-20230310-00001-g4b3965bcb967-dirty #47
[  243.755030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 
04/01/2014
[  243.758516] RIP: 0010:md_do_sync+0x16a9/0x1b00
[  243.759583] Code: ff 48 83 05 60 ce a7 0c 01 e9 8d f9 ff ff 48 83 05 
13 ce a7 0c 01 48 c7 c6 e9 e0 29 83 e9 3b f9 ff ff 48 83 05 5f d0 a7 0c 
01 <0f> 0b 48 83 05 5d d0 a7 0c 01 e8 f8 d5 0b0
[  243.763661] RSP: 0018:ffffc90003847d50 EFLAGS: 00010202
[  243.764212] RAX: 0000000000000028 RBX: ffff88817b529000 RCX: 
0000000000000000
[  243.764936] RDX: 0000000000000000 RSI: 0000000000000206 RDI: 
ffff888100040740
[  243.765648] RBP: 00000000002d6780 R08: 0101010101010101 R09: 
ffff888165671d80
[  243.766352] R10: ffffffff8ad6096c R11: ffff88816fcfa9f0 R12: 
0000000000000001
[  243.767066] R13: ffff888173920040 R14: ffff88817b529000 R15: 
0000000000187100
[  243.767781] FS:  0000000000000000(0000) GS:ffff888ffef80000(0000) 
knlGS:0000000000000000
[  243.768588] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  243.769172] CR2: 00005599effa8451 CR3: 00000001663e6000 CR4: 
00000000000006e0
[  243.769888] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[  243.770598] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[  243.771300] Call Trace:
[  243.771555]  <TASK>
[  243.771779]  ? kvm_clock_read+0x14/0x30
[  243.772169]  ? kvm_sched_clock_read+0x9/0x20
[  243.772611]  ? sched_clock_cpu+0x21/0x330
[  243.773023]  md_thread+0x2ec/0x300
[  243.773373]  ? md_write_start+0x420/0x420
[  243.773845]  kthread+0x13e/0x1a0
[  243.774210]  ? kthread_exit+0x50/0x50
[  243.774591]  ret_from_fork+0x1f/0x30

Thanks,
Guoqing

.

From 0f82a9298db4b3711863022dc0805c908db3bb98 Mon Sep 17 00:00:00 2001
From: Li Nan <linan122@xxxxxxxxxx>
Date: Mon, 20 Mar 2023 16:53:08 +0800
Subject: [PATCH] echo idle

Signed-off-by: Yu Kuai <yukuai3@xxxxxxxxxx>
---
 drivers/md/md.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 546b1b81eb28..196810067f29 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4753,6 +4753,7 @@ action_show(struct mddev *mddev, char *page)
 	return sprintf(page, "%s\n", type);
 }
 
+static bool set_delay = false;
 static ssize_t
 action_store(struct mddev *mddev, const char *page, size_t len)
 {
@@ -4775,6 +4776,11 @@ action_store(struct mddev *mddev, const char *page, size_t len)
 				mddev_unlock(mddev);
 				set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 				md_unregister_thread(&mddev->sync_thread);
+				if (set_delay) {
+					printk("%s unregister success! delay 10s\n", __func__);
+					mdelay(10000);
+					printk("%s delay done\n", __func__);
+				}
 				mddev_lock_nointr(mddev);
 				/*
 				 * set RECOVERY_INTR again and restore reshape
@@ -8700,6 +8706,13 @@ void md_do_sync(struct md_thread *thread)
 	struct blk_plug plug;
 	int ret;
 
+	if (set_delay) {
+		printk("%s: before new wor, sleep 10s\n", __func__);
+		mdelay(1000);
+		printk("%s: delay done\n", __func__);
+	}
+	BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
+
 	/* just incase thread restarts... */
 	if (test_bit(MD_RECOVERY_DONE, &mddev->recovery) ||
 	    test_bit(MD_RECOVERY_WAIT, &mddev->recovery))
@@ -8899,6 +8912,7 @@ void md_do_sync(struct md_thread *thread)
 
 		skipped = 0;
 
+		BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
 		if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) &&
 		    ((mddev->curr_resync > mddev->curr_resync_completed &&
 		      (mddev->curr_resync - mddev->curr_resync_completed)
@@ -9067,6 +9081,7 @@ void md_do_sync(struct md_thread *thread)
 	/* set CHANGE_PENDING here since maybe another update is needed,
 	 * so other nodes are informed. It should be harmless for normal
 	 * raid */
+	BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
 	set_mask_bits(&mddev->sb_flags, 0,
 		      BIT(MD_SB_CHANGE_PENDING) | BIT(MD_SB_CHANGE_DEVS));
 
@@ -9361,6 +9376,7 @@ void md_check_recovery(struct mddev *mddev)
 		mddev->curr_resync_completed = 0;
 		spin_lock(&mddev->lock);
 		set_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
+		printk("%s: running is set\n", __func__);
 		spin_unlock(&mddev->lock);
 		/* Clear some bits that don't mean anything, but
 		 * might be left set
@@ -9405,6 +9421,9 @@ void md_check_recovery(struct mddev *mddev)
 				 */
 				md_bitmap_write_all(mddev->bitmap);
 			}
+
+			printk("%s: queue new sync thread\n", __func__);
+			BUG_ON(!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
 			INIT_WORK(&mddev->del_work, md_start_sync);
 			queue_work(md_misc_wq, &mddev->del_work);
 			goto unlock;
@@ -9463,6 +9482,7 @@ void md_reap_sync_thread(struct mddev *mddev)
 	if (test_and_clear_bit(MD_CLUSTER_RESYNC_LOCKED, &mddev->flags))
 		md_cluster_ops->resync_finish(mddev);
 	clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
+	printk("%s: running is cleared!\n", __func__);
 	clear_bit(MD_RECOVERY_DONE, &mddev->recovery);
 	clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
 	clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
@@ -9956,6 +9976,7 @@ module_param_call(start_ro, set_ro, get_ro, NULL, S_IRUSR|S_IWUSR);
 module_param(start_dirty_degraded, int, S_IRUGO|S_IWUSR);
 module_param_call(new_array, add_named_array, NULL, NULL, S_IWUSR);
 module_param(create_on_open, bool, S_IRUSR|S_IWUSR);
+module_param(set_delay, bool, S_IRUSR|S_IWUSR);
 
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("MD RAID framework");
-- 
2.31.1