Re: RAID1 write-mostly+write-behind lockup bug, reproduced under 6.7-rc5

Alexey Klimov <alexey.klimov@xxxxxxxxxx> · Mon, 08 Jan 2024 06:04:14 +0000

On Tue, 2024-01-02 at 10:54 +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2023/12/30 9:27, Alexey Klimov 写道:
> > On Thu, 28 Dec 2023 at 12:09, Yu Kuai <yukuai1@xxxxxxxxxxxxxxx>
> > wrote:
> > > 
> > > Hi,
> > > 
> > > 在 2023/12/28 12:19, Alexey Klimov 写道:
> > > > On Thu, 14 Dec 2023 at 01:34, Yu Kuai <yukuai1@xxxxxxxxxxxxxxx>
> > > > wrote:
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > 在 2023/12/14 7:48, Alexey Klimov 写道:
> > > > > > Hi all,
> > > > > > 
> > > > > > After assembling raid1 consisting from two NVMe
> > > > > > disks/partitions where
> > > > > > one of the NVMes is slower than the other one using such
> > > > > > command:
> > > > > > mdadm --homehost=any --create --verbose --level=1 --
> > > > > > metadata=1.2
> > > > > > --raid-devices=2 /dev/md77 /dev/nvme2n1p9 --bitmap=internal
> > > > > > --write-mostly --write-behind=8192 /dev/nvme1n1p2
> > > > > > 
> > > > > > I noticed some I/O freezing/lockup issues when doing distro
> > > > > > builds
> > > > > > using yocto. The idea of building write-mostly raid1 came
> > > > > > from URL
> > > > > > [0]. I suspected that massive and long IO operations led to
> > > > > > that and
> > > > > > while trying to narrow it down I can see that it doesn't
> > > > > > survive
> > > > > > through rebuilding linux kernel (just simple make -j33).
> > > > > > 
> > > > > > After enabling some lock checks in kernel and lockup
> > > > > > detectors I think
> > > > > > this is the main blocked task message:
> > > > > > 
> > > > > > [  984.138650] INFO: task kworker/u65:5:288 blocked for
> > > > > > more than 491 seconds.
> > > > > > [  984.138682]       Not tainted 6.7.0-rc5-00047-
> > > > > > g5bd7ef53ffe5 #1
> > > > > > [  984.138694] "echo 0 >
> > > > > > /proc/sys/kernel/hung_task_timeout_secs"
> > > > > > disables this message.
> > > > > > [  984.138702] task:kworker/u65:5   state:D stack:0    
> > > > > > pid:288
> > > > > > tgid:288   ppid:2      flags:0x00004000
> > > > > > [  984.138728] Workqueue: writeback wb_workfn (flush-9:77)
> > > > > > [  984.138760] Call Trace:
> > > > > > [  984.138770]  <TASK>
> > > > > > [  984.138785]  __schedule+0x3a5/0x1600
> > > > > > [  984.138807]  ? schedule+0x99/0x120
> > > > > > [  984.138818]  ? find_held_lock+0x2b/0x80
> > > > > > [  984.138840]  schedule+0x48/0x120
> > > > > > [  984.138851]  ? schedule+0x99/0x120
> > > > > > [  984.138861]  wait_for_serialization+0xd2/0x110
> > > > > 
> > > > > This is waiting for issued IO to be done, from
> > > > > raid1_end_write_request
> > > > >     remove_serial
> > > > >      raid1_rb_remove
> > > > >      wake_up
> > > > 
> > > > Yep, looks like this.
> > > > 
> > > > > So the first thing need clarification is that is there
> > > > > unfinished IO
> > > > > from underlying disk? This is not easy, but perhaps you can
> > > > > try:
> > > > > 
> > > > > 1) don't use the underlying disks by anyone else;
> > > > > 2) reporduce the problem, and then collect debugfs info for
> > > > > underlying
> > > > > disks with following cmd:
> > > > > 
> > > > > find /sys/kernel/debug/block/sda/ -type f | xargs grep .
> > > > 
> > > > I collected this and attaching to this email.
> > > > When I collected this debug data I also noticed the following
> > > > inflight counters:
> > > > root@tux:/sys/devices/virtual/block/md77# cat inflight
> > > >          0       65
> > > > root@tux:/sys/devices/virtual/block/md77# cat
> > > > slaves/nvme1n1p2/inflight
> > > >          0        0
> > > > root@tux:/sys/devices/virtual/block/md77# cat
> > > > slaves/nvme2n1p9/inflight
> > > >          0        0
> > > > 
> > > > So I guess on the md or raid1 level there are 65 write requests
> > > > that
> > > > didn't finish but nothing from underlying physical devices,
> > > > right?
> > > 
> > > Actually, IOs stuck in wait_for_serialization() are accounted by
> > > inflight from md77.
> > 
> > Do you think ftracing mdXY_raid1 kernel thread with function plugin
> > will help?
> > 
> > > However, since there are no IO from nvme disks, looks like
> > > something is
> > > wrong with write behind.
> > > 
> > > > 
> > > > Apart from that. When the lockup/freeze happens I can mount
> > > > other
> > > > partitions on the corresponding nvme devices and create files
> > > > there.
> > > > nvme-cli util also don't show any issues AFAICS.
> > > > 
> > > > When I manually set backlog file to zero:
> > > > echo 0 > /sys/devices/virtual/block/md77/md/bitmap/backlog
> > > > the lockup is no longer reproducible.
> > > 
> > > Of course, this disables write behind, which also indicates
> > > something
> > > is wrong with write behind.
> > 
> > Yes, this is in some sense a main topic of my emails. I am just
> > trying
> > to get you more debugging data.
> > 
> > > I'm trying to review related code, however, this might be
> > > difficult,
> > > can you discribe how do you reporduce this problem in detailed
> > > steps?
> > > and I will try to reporduce this. If I still can't, can you apply
> > > a
> > > debug patch and recompile the kernel to test?
> > 
> > I am trying to reproduce this using files and loops devices and
> > I'll
> > provide instructions.
> > Regarding patches: yes, see my first email. I will be happy to test
> > patches and collect more debug data.
> 
> Attached patch can provide debug info for each r10_bio that is
> serialized or wait for serialization, to judge if the problem is
> related to r1_bio dispatching or serialization algorithm.
> 
> After the problem is reporduced, just cat /sys/block/md77/md/debug

I haven't yet tested it with that patch to collect debug info, I'll try
to do it during this week. But I managed to reproduce it on another
physical set of NVMe disks. As far as I understand it won't be possible
to reproduce it on two fast disks. To reproduce it one of the disks
should be slower -- to achieve that I connected one of the NVMe disks
via pcie-to-usb adapter (raid sync speed was around 40Mb/s).

Here are the instructions to reproduce it.

0. Prepare disks. I am using one physical nvme disk over usb and one
loop file on fast nvme disk.

time dd if=/dev/zero of=file0 bs=1M count=96k status=progress
losetup -f file0

1. Create raid1 device with write-mostly+write-behind
mdadm --homehost=any --create --verbose --level=1 --metadata=1.2 \
--raid-devices=2 /dev/md96 /dev/loop0 --bitmap=internal \
--write-mostly --write-behind=8192 /dev/sda

2. Format it.
mkfs.ext4 -v -m 0 -b 4096 -O fast_commit,metadata_csum \
-E lazy_itable_init=0,lazy_journal_init=0 /dev/md96

3. Mount somewhere
sudo mount /dev/md96 md96 -o noatime,user,exec

4. Right now in order to reproduce it I compile android mainline kernel
for Pixel6 devices. I just copy the repository to the md96 but it can
be synced online.

https://source.android.com/docs/setup/build/building-pixel-kernels
Section: Download and Compile the Kernel

repo init -u https://android.googlesource.com/kernel/manifest -b gs-
android-gs-raviole-mainline

repo sync -f --fail-fast --force-sync --force-remove-dirty -v -j4 -c \
--no-tags  --jobs-checkout=$(nproc)

5. The compilation commands:
tools/bazel clean
tools/bazel run --config=stamp //private/google-
modules/soc/gs:slider_dist

You can monitor inflight counters here. Once the second counter goes as
high as 65 or larger then mostly likely the freeze/hang already
happened.
watch -d -n 1 cat /sys/devices/virtual/block/md96/inflight

I think there should be other tests to reproduce it, it seems massive
I/O is required with some long amount of write requests. Also this can
be reproduced with two loop file devices -- one on fast nvme disk and
another file is on slow nvme disk.

I think that without loop devices it takes significantly more time to
repair the filesystem and I can't confirm that there are no some file
corruptions after that fsck.

Hope this works.

Best regards,
Alexey