Re: RAID1 write-mostly+write-behind lockup bug, reproduced under 6.7-rc5

Alexey Klimov <alexey.klimov@xxxxxxxxxx> · Sat, 30 Dec 2023 01:27:08 +0000

On Thu, 28 Dec 2023 at 12:09, Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> 在 2023/12/28 12:19, Alexey Klimov 写道:
> > On Thu, 14 Dec 2023 at 01:34, Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
> >>
> >> Hi,
> >>
> >> 在 2023/12/14 7:48, Alexey Klimov 写道:
> >>> Hi all,
> >>>
> >>> After assembling raid1 consisting from two NVMe disks/partitions where
> >>> one of the NVMes is slower than the other one using such command:
> >>> mdadm --homehost=any --create --verbose --level=1 --metadata=1.2
> >>> --raid-devices=2 /dev/md77 /dev/nvme2n1p9 --bitmap=internal
> >>> --write-mostly --write-behind=8192 /dev/nvme1n1p2
> >>>
> >>> I noticed some I/O freezing/lockup issues when doing distro builds
> >>> using yocto. The idea of building write-mostly raid1 came from URL
> >>> [0]. I suspected that massive and long IO operations led to that and
> >>> while trying to narrow it down I can see that it doesn't survive
> >>> through rebuilding linux kernel (just simple make -j33).
> >>>
> >>> After enabling some lock checks in kernel and lockup detectors I think
> >>> this is the main blocked task message:
> >>>
> >>> [  984.138650] INFO: task kworker/u65:5:288 blocked for more than 491 seconds.
> >>> [  984.138682]       Not tainted 6.7.0-rc5-00047-g5bd7ef53ffe5 #1
> >>> [  984.138694] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >>> disables this message.
> >>> [  984.138702] task:kworker/u65:5   state:D stack:0     pid:288
> >>> tgid:288   ppid:2      flags:0x00004000
> >>> [  984.138728] Workqueue: writeback wb_workfn (flush-9:77)
> >>> [  984.138760] Call Trace:
> >>> [  984.138770]  <TASK>
> >>> [  984.138785]  __schedule+0x3a5/0x1600
> >>> [  984.138807]  ? schedule+0x99/0x120
> >>> [  984.138818]  ? find_held_lock+0x2b/0x80
> >>> [  984.138840]  schedule+0x48/0x120
> >>> [  984.138851]  ? schedule+0x99/0x120
> >>> [  984.138861]  wait_for_serialization+0xd2/0x110
> >>
> >> This is waiting for issued IO to be done, from
> >> raid1_end_write_request
> >>    remove_serial
> >>     raid1_rb_remove
> >>     wake_up
> >
> > Yep, looks like this.
> >
> >> So the first thing need clarification is that is there unfinished IO
> >> from underlying disk? This is not easy, but perhaps you can try:
> >>
> >> 1) don't use the underlying disks by anyone else;
> >> 2) reporduce the problem, and then collect debugfs info for underlying
> >> disks with following cmd:
> >>
> >> find /sys/kernel/debug/block/sda/ -type f | xargs grep .
> >
> > I collected this and attaching to this email.
> > When I collected this debug data I also noticed the following inflight counters:
> > root@tux:/sys/devices/virtual/block/md77# cat inflight
> >         0       65
> > root@tux:/sys/devices/virtual/block/md77# cat slaves/nvme1n1p2/inflight
> >         0        0
> > root@tux:/sys/devices/virtual/block/md77# cat slaves/nvme2n1p9/inflight
> >         0        0
> >
> > So I guess on the md or raid1 level there are 65 write requests that
> > didn't finish but nothing from underlying physical devices, right?
>
> Actually, IOs stuck in wait_for_serialization() are accounted by
> inflight from md77.

Do you think ftracing mdXY_raid1 kernel thread with function plugin will help?

> However, since there are no IO from nvme disks, looks like something is
> wrong with write behind.
>
> >
> > Apart from that. When the lockup/freeze happens I can mount other
> > partitions on the corresponding nvme devices and create files there.
> > nvme-cli util also don't show any issues AFAICS.
> >
> > When I manually set backlog file to zero:
> > echo 0 > /sys/devices/virtual/block/md77/md/bitmap/backlog
> > the lockup is no longer reproducible.
>
> Of course, this disables write behind, which also indicates something
> is wrong with write behind.

Yes, this is in some sense a main topic of my emails. I am just trying
to get you more debugging data.

> I'm trying to review related code, however, this might be difficult,
> can you discribe how do you reporduce this problem in detailed steps?
> and I will try to reporduce this. If I still can't, can you apply a
> debug patch and recompile the kernel to test?

I am trying to reproduce this using files and loops devices and I'll
provide instructions.
Regarding patches: yes, see my first email. I will be happy to test
patches and collect more debug data.

Thanks,
Alexey