Re: RAID1 write-mostly+write-behind lockup bug, reproduced under 6.7-rc5

Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> · Thu, 28 Dec 2023 20:09:24 +0800

Hi,

在 2023/12/28 12:19, Alexey Klimov 写道:
On Thu, 14 Dec 2023 at 01:34, Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:

Hi,

在 2023/12/14 7:48, Alexey Klimov 写道:
Hi all,

After assembling raid1 consisting from two NVMe disks/partitions where
one of the NVMes is slower than the other one using such command:
mdadm --homehost=any --create --verbose --level=1 --metadata=1.2
--raid-devices=2 /dev/md77 /dev/nvme2n1p9 --bitmap=internal
--write-mostly --write-behind=8192 /dev/nvme1n1p2

I noticed some I/O freezing/lockup issues when doing distro builds
using yocto. The idea of building write-mostly raid1 came from URL
[0]. I suspected that massive and long IO operations led to that and
while trying to narrow it down I can see that it doesn't survive
through rebuilding linux kernel (just simple make -j33).

After enabling some lock checks in kernel and lockup detectors I think
this is the main blocked task message:

[  984.138650] INFO: task kworker/u65:5:288 blocked for more than 491 seconds.
[  984.138682]       Not tainted 6.7.0-rc5-00047-g5bd7ef53ffe5 #1
[  984.138694] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  984.138702] task:kworker/u65:5   state:D stack:0     pid:288
tgid:288   ppid:2      flags:0x00004000
[  984.138728] Workqueue: writeback wb_workfn (flush-9:77)
[  984.138760] Call Trace:
[  984.138770]  <TASK>
[  984.138785]  __schedule+0x3a5/0x1600
[  984.138807]  ? schedule+0x99/0x120
[  984.138818]  ? find_held_lock+0x2b/0x80
[  984.138840]  schedule+0x48/0x120
[  984.138851]  ? schedule+0x99/0x120
[  984.138861]  wait_for_serialization+0xd2/0x110

This is waiting for issued IO to be done, from
raid1_end_write_request
   remove_serial
    raid1_rb_remove
    wake_up

Yep, looks like this.

So the first thing need clarification is that is there unfinished IO
from underlying disk? This is not easy, but perhaps you can try:

1) don't use the underlying disks by anyone else;
2) reporduce the problem, and then collect debugfs info for underlying
disks with following cmd:

find /sys/kernel/debug/block/sda/ -type f | xargs grep .

I collected this and attaching to this email.
When I collected this debug data I also noticed the following inflight counters:
root@tux:/sys/devices/virtual/block/md77# cat inflight
        0       65
root@tux:/sys/devices/virtual/block/md77# cat slaves/nvme1n1p2/inflight
        0        0
root@tux:/sys/devices/virtual/block/md77# cat slaves/nvme2n1p9/inflight
        0        0

So I guess on the md or raid1 level there are 65 write requests that
didn't finish but nothing from underlying physical devices, right?

Actually, IOs stuck in wait_for_serialization() are accounted by
inflight from md77.

However, since there are no IO from nvme disks, looks like something is
wrong with write behind.

Apart from that. When the lockup/freeze happens I can mount other
partitions on the corresponding nvme devices and create files there.
nvme-cli util also don't show any issues AFAICS.

When I manually set backlog file to zero:
echo 0 > /sys/devices/virtual/block/md77/md/bitmap/backlog
the lockup is no longer reproducible.

Of course, this disables write behind, which also indicates something
is wrong with write behind.

I'm trying to review related code, however, this might be difficult,
can you discribe how do you reporduce this problem in detailed steps?
and I will try to reporduce this. If I still can't, can you apply a
debug patch and recompile the kernel to test?

Thanks,
Kuai

Let me know what other debug data I can collect.

Thanks,
Alexey