Re: RAID1 write-mostly+write-behind lockup bug, reproduced under 6.7-rc5

Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> · Tue, 2 Jan 2024 10:54:13 +0800

Hi,

在 2023/12/30 9:27, Alexey Klimov 写道:
On Thu, 28 Dec 2023 at 12:09, Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:

Hi,

在 2023/12/28 12:19, Alexey Klimov 写道:
On Thu, 14 Dec 2023 at 01:34, Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:

Hi,

在 2023/12/14 7:48, Alexey Klimov 写道:
Hi all,

After assembling raid1 consisting from two NVMe disks/partitions where
one of the NVMes is slower than the other one using such command:
mdadm --homehost=any --create --verbose --level=1 --metadata=1.2
--raid-devices=2 /dev/md77 /dev/nvme2n1p9 --bitmap=internal
--write-mostly --write-behind=8192 /dev/nvme1n1p2

I noticed some I/O freezing/lockup issues when doing distro builds
using yocto. The idea of building write-mostly raid1 came from URL
[0]. I suspected that massive and long IO operations led to that and
while trying to narrow it down I can see that it doesn't survive
through rebuilding linux kernel (just simple make -j33).

After enabling some lock checks in kernel and lockup detectors I think
this is the main blocked task message:

[  984.138650] INFO: task kworker/u65:5:288 blocked for more than 491 seconds.
[  984.138682]       Not tainted 6.7.0-rc5-00047-g5bd7ef53ffe5 #1
[  984.138694] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  984.138702] task:kworker/u65:5   state:D stack:0     pid:288
tgid:288   ppid:2      flags:0x00004000
[  984.138728] Workqueue: writeback wb_workfn (flush-9:77)
[  984.138760] Call Trace:
[  984.138770]  <TASK>
[  984.138785]  __schedule+0x3a5/0x1600
[  984.138807]  ? schedule+0x99/0x120
[  984.138818]  ? find_held_lock+0x2b/0x80
[  984.138840]  schedule+0x48/0x120
[  984.138851]  ? schedule+0x99/0x120
[  984.138861]  wait_for_serialization+0xd2/0x110

This is waiting for issued IO to be done, from
raid1_end_write_request
    remove_serial
     raid1_rb_remove
     wake_up

Yep, looks like this.

So the first thing need clarification is that is there unfinished IO
from underlying disk? This is not easy, but perhaps you can try:

1) don't use the underlying disks by anyone else;
2) reporduce the problem, and then collect debugfs info for underlying
disks with following cmd:

find /sys/kernel/debug/block/sda/ -type f | xargs grep .

I collected this and attaching to this email.
When I collected this debug data I also noticed the following inflight counters:
root@tux:/sys/devices/virtual/block/md77# cat inflight
         0       65
root@tux:/sys/devices/virtual/block/md77# cat slaves/nvme1n1p2/inflight
         0        0
root@tux:/sys/devices/virtual/block/md77# cat slaves/nvme2n1p9/inflight
         0        0

So I guess on the md or raid1 level there are 65 write requests that
didn't finish but nothing from underlying physical devices, right?

Actually, IOs stuck in wait_for_serialization() are accounted by
inflight from md77.

Do you think ftracing mdXY_raid1 kernel thread with function plugin will help?

However, since there are no IO from nvme disks, looks like something is
wrong with write behind.


Apart from that. When the lockup/freeze happens I can mount other
partitions on the corresponding nvme devices and create files there.
nvme-cli util also don't show any issues AFAICS.

When I manually set backlog file to zero:
echo 0 > /sys/devices/virtual/block/md77/md/bitmap/backlog
the lockup is no longer reproducible.

Of course, this disables write behind, which also indicates something
is wrong with write behind.

Yes, this is in some sense a main topic of my emails. I am just trying
to get you more debugging data.

I'm trying to review related code, however, this might be difficult,
can you discribe how do you reporduce this problem in detailed steps?
and I will try to reporduce this. If I still can't, can you apply a
debug patch and recompile the kernel to test?

I am trying to reproduce this using files and loops devices and I'll
provide instructions.
Regarding patches: yes, see my first email. I will be happy to test
patches and collect more debug data.

Attached patch can provide debug info for each r10_bio that is
serialized or wait for serialization, to judge if the problem is
related to r1_bio dispatching or serialization algorithm.

After the problem is reporduced, just cat /sys/block/md77/md/debug

Thanks,
Kuai

Thanks,
Alexey
.

From e6fafea7ff62a8092d937b63af4b98706d06e4f0 Mon Sep 17 00:00:00 2001
From: Yu Kuai <yukuai3@xxxxxxxxxx>
Date: Tue, 2 Jan 2024 10:02:46 +0800
Subject: [PATCH] md: debug r1_bio

Signed-off-by: Yu Kuai <yukuai3@xxxxxxxxxx>
---
 drivers/md/md.c    | 15 +++++++++++++
 drivers/md/md.h    |  1 +
 drivers/md/raid1.c | 54 ++++++++++++++++++++++++++++++++++++++--------
 drivers/md/raid1.h |  5 +++++
 4 files changed, 66 insertions(+), 9 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index ce32970aecdc..1dcee177aaa8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5599,6 +5599,20 @@ static struct md_sysfs_entry md_serialize_policy =
 __ATTR(serialize_policy, S_IRUGO | S_IWUSR, serialize_policy_show,
        serialize_policy_store);
 
+static ssize_t
+debug_show(struct mddev *mddev, char *page)
+{
+	struct md_personality *pers = mddev->pers;
+
+	if (!pers || !pers->debug)
+		return sprintf(page, "n/a\n");
+
+	pers->debug(mddev);
+	return sprintf(page, "see debug info in kernel log\n");
+}
+
+static struct md_sysfs_entry md_debug =
+__ATTR(debug, 0444, debug_show, NULL);
 
 static struct attribute *md_default_attrs[] = {
 	&md_level.attr,
@@ -5619,6 +5633,7 @@ static struct attribute *md_default_attrs[] = {
 	&md_consistency_policy.attr,
 	&md_fail_last_dev.attr,
 	&md_serialize_policy.attr,
+	&md_debug.attr,
 	NULL,
 };
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 8d881cc59799..40fd4e324cdb 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -632,6 +632,7 @@ struct md_personality
 	 * array.
 	 */
 	void *(*takeover) (struct mddev *mddev);
+	void (*debug) (struct mddev *mddev);
 	/* Changes the consistency policy of an active array. */
 	int (*change_consistency_policy)(struct mddev *mddev, const char *buf);
 };
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index aaa434f0c175..eb763f41dcbb 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -65,17 +65,23 @@ static int check_and_add_serial(struct md_rdev *rdev, struct r1bio *r1_bio,
 	sector_t lo = r1_bio->sector;
 	sector_t hi = lo + r1_bio->sectors;
 	struct serial_in_rdev *serial = &rdev->serial[idx];
+	struct r1conf *conf = rdev->mddev->private;
 
-	spin_lock_irqsave(&serial->serial_lock, flags);
+	spin_lock_irqsave(&conf->debug_lock, flags);
+	spin_lock(&serial->serial_lock);
+	list_del_init(&r1_bio->debug_list);
 	/* collision happened */
-	if (raid1_rb_iter_first(&serial->serial_rb, lo, hi))
+	if (raid1_rb_iter_first(&serial->serial_rb, lo, hi)) {
 		ret = -EBUSY;
-	else {
+		list_add(&r1_bio->debug_list, &conf->busy_list);
+	} else {
 		si->start = lo;
 		si->last = hi;
 		raid1_rb_insert(si, &serial->serial_rb);
+		list_add(&r1_bio->debug_list, &conf->wait_list);
 	}
-	spin_unlock_irqrestore(&serial->serial_lock, flags);
+	spin_unlock(&serial->serial_lock);
+	spin_unlock_irqrestore(&conf->debug_lock, flags);
 
 	return ret;
 }
@@ -94,7 +100,8 @@ static void wait_for_serialization(struct md_rdev *rdev, struct r1bio *r1_bio)
 		   check_and_add_serial(rdev, r1_bio, si, idx) == 0);
 }
 
-static void remove_serial(struct md_rdev *rdev, sector_t lo, sector_t hi)
+static void remove_serial(struct md_rdev *rdev, sector_t lo, sector_t hi,
+			  struct r1bio *r1_bio)
 {
 	struct serial_info *si;
 	unsigned long flags;
@@ -102,12 +109,15 @@ static void remove_serial(struct md_rdev *rdev, sector_t lo, sector_t hi)
 	struct mddev *mddev = rdev->mddev;
 	int idx = sector_to_idx(lo);
 	struct serial_in_rdev *serial = &rdev->serial[idx];
+	struct r1conf *conf = rdev->mddev->private;
 
-	spin_lock_irqsave(&serial->serial_lock, flags);
+	spin_lock_irqsave(&conf->debug_lock, flags);
+	spin_lock(&serial->serial_lock);
 	for (si = raid1_rb_iter_first(&serial->serial_rb, lo, hi);
 	     si; si = raid1_rb_iter_next(si, lo, hi)) {
 		if (si->start == lo && si->last == hi) {
 			raid1_rb_remove(si, &serial->serial_rb);
+			list_del_init(&r1_bio->debug_list);
 			mempool_free(si, mddev->serial_info_pool);
 			found = 1;
 			break;
@@ -115,7 +125,8 @@ static void remove_serial(struct md_rdev *rdev, sector_t lo, sector_t hi)
 	}
 	if (!found)
 		WARN(1, "The write IO is not recorded for serialization\n");
-	spin_unlock_irqrestore(&serial->serial_lock, flags);
+	spin_unlock(&serial->serial_lock);
+	spin_unlock_irqrestore(&conf->debug_lock, flags);
 	wake_up(&serial->serial_io_wait);
 }
 
@@ -314,6 +325,7 @@ static void raid_end_bio_io(struct r1bio *r1_bio)
 	struct r1conf *conf = r1_bio->mddev->private;
 	sector_t sector = r1_bio->sector;
 
+	BUG_ON(!list_empty(&r1_bio->debug_list));
 	/* if nobody has done the final endio yet, do it now */
 	if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
 		pr_debug("raid1: sync end %s on sectors %llu-%llu\n",
@@ -525,7 +537,7 @@ static void raid1_end_write_request(struct bio *bio)
 
 	if (behind) {
 		if (test_bit(CollisionCheck, &rdev->flags))
-			remove_serial(rdev, lo, hi);
+			remove_serial(rdev, lo, hi, r1_bio);
 		if (test_bit(WriteMostly, &rdev->flags))
 			atomic_dec(&r1_bio->behind_remaining);
 
@@ -549,7 +561,7 @@ static void raid1_end_write_request(struct bio *bio)
 			}
 		}
 	} else if (rdev->mddev->serialize_policy)
-		remove_serial(rdev, lo, hi);
+		remove_serial(rdev, lo, hi, r1_bio);
 	if (r1_bio->bios[mirror] == NULL)
 		rdev_dec_pending(rdev, conf->mddev);
 
@@ -1206,6 +1218,7 @@ alloc_r1bio(struct mddev *mddev, struct bio *bio)
 	/* Ensure no bio records IO_BLOCKED */
 	memset(r1_bio->bios, 0, conf->raid_disks * sizeof(r1_bio->bios[0]));
 	init_r1bio(r1_bio, mddev, bio);
+	INIT_LIST_HEAD(&r1_bio->debug_list);
 	return r1_bio;
 }
 
@@ -3006,6 +3019,7 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 
 	err = -EINVAL;
 	spin_lock_init(&conf->device_lock);
+	spin_lock_init(&conf->debug_lock);
 	rdev_for_each(rdev, mddev) {
 		int disk_idx = rdev->raid_disk;
 		if (disk_idx >= mddev->raid_disks
@@ -3025,6 +3039,8 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 	conf->raid_disks = mddev->raid_disks;
 	conf->mddev = mddev;
 	INIT_LIST_HEAD(&conf->retry_list);
+	INIT_LIST_HEAD(&conf->busy_list);
+	INIT_LIST_HEAD(&conf->wait_list);
 	INIT_LIST_HEAD(&conf->bio_end_io_list);
 
 	spin_lock_init(&conf->resync_lock);
@@ -3352,6 +3368,7 @@ static void *raid1_takeover(struct mddev *mddev)
 		mddev->new_chunk_sectors = 0;
 		conf = setup_conf(mddev);
 		if (!IS_ERR(conf)) {
+
 			/* Array must appear to be quiesced */
 			conf->array_frozen = 1;
 			mddev_clear_unsupported_flags(mddev,
@@ -3362,6 +3379,24 @@ static void *raid1_takeover(struct mddev *mddev)
 	return ERR_PTR(-EINVAL);
 }
 
+static void raid1_debug(struct mddev *mddev)
+{
+	struct r1conf *conf = mddev->private;
+	unsigned long flags;
+	struct r1bio *r1_bio;
+
+	spin_lock_irqsave(&conf->debug_lock, flags);
+	list_for_each_entry(r1_bio, &conf->busy_list, debug_list) {
+		printk("%s: r1_bio %px busy (%llu + %u)\n", __func__, r1_bio,
+		       r1_bio->sector, r1_bio->sectors);
+	}
+	list_for_each_entry(r1_bio, &conf->wait_list, debug_list) {
+		printk("%s: r1_bio %px wait (%llu + %u)\n", __func__, r1_bio,
+		       r1_bio->sector, r1_bio->sectors);
+	}
+	spin_unlock_irqrestore(&conf->debug_lock, flags);
+}
+
 static struct md_personality raid1_personality =
 {
 	.name		= "raid1",
@@ -3381,6 +3416,7 @@ static struct md_personality raid1_personality =
 	.check_reshape	= raid1_reshape,
 	.quiesce	= raid1_quiesce,
 	.takeover	= raid1_takeover,
+	.debug		= raid1_debug,
 };
 
 static int __init raid_init(void)
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index 14d4211a123a..8163a3fb6e48 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -79,6 +79,10 @@ struct r1conf {
 	 * block, or anything else.
 	 */
 	struct list_head	retry_list;
+
+	spinlock_t		debug_lock;
+	struct list_head	busy_list;
+	struct list_head	wait_list;
 	/* A separate list of r1bio which just need raid_end_bio_io called.
 	 * This mustn't happen for writes which had any errors if the superblock
 	 * needs to be written.
@@ -168,6 +172,7 @@ struct r1bio {
 	int			read_disk;
 
 	struct list_head	retry_list;
+	struct list_head	debug_list;
 
 	/*
 	 * When R1BIO_BehindIO is set, we store pages for write behind
-- 
2.39.2