On Tue, 2024-01-02 at 10:54 +0800, Yu Kuai wrote: > Hi, > > 在 2023/12/30 9:27, Alexey Klimov 写道: > > On Thu, 28 Dec 2023 at 12:09, Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> > > wrote: > > > > > > Hi, > > > > > > 在 2023/12/28 12:19, Alexey Klimov 写道: > > > > On Thu, 14 Dec 2023 at 01:34, Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> > > > > wrote: > > > > > > > > > > Hi, > > > > > > > > > > 在 2023/12/14 7:48, Alexey Klimov 写道: > > > > > > Hi all, > > > > > > > > > > > > After assembling raid1 consisting from two NVMe > > > > > > disks/partitions where > > > > > > one of the NVMes is slower than the other one using such > > > > > > command: > > > > > > mdadm --homehost=any --create --verbose --level=1 -- > > > > > > metadata=1.2 > > > > > > --raid-devices=2 /dev/md77 /dev/nvme2n1p9 --bitmap=internal > > > > > > --write-mostly --write-behind=8192 /dev/nvme1n1p2 > > > > > > > > > > > > I noticed some I/O freezing/lockup issues when doing distro > > > > > > builds > > > > > > using yocto. The idea of building write-mostly raid1 came > > > > > > from URL > > > > > > [0]. I suspected that massive and long IO operations led to > > > > > > that and > > > > > > while trying to narrow it down I can see that it doesn't > > > > > > survive > > > > > > through rebuilding linux kernel (just simple make -j33). > > > > > > > > > > > > After enabling some lock checks in kernel and lockup > > > > > > detectors I think > > > > > > this is the main blocked task message: > > > > > > > > > > > > [ 984.138650] INFO: task kworker/u65:5:288 blocked for > > > > > > more than 491 seconds. > > > > > > [ 984.138682] Not tainted 6.7.0-rc5-00047- > > > > > > g5bd7ef53ffe5 #1 > > > > > > [ 984.138694] "echo 0 > > > > > > > /proc/sys/kernel/hung_task_timeout_secs" > > > > > > disables this message. > > > > > > [ 984.138702] task:kworker/u65:5 state:D stack:0 > > > > > > pid:288 > > > > > > tgid:288 ppid:2 flags:0x00004000 > > > > > > [ 984.138728] Workqueue: writeback wb_workfn (flush-9:77) > > > > > > [ 984.138760] Call Trace: > > > > > > [ 984.138770] <TASK> > > > > > > [ 984.138785] __schedule+0x3a5/0x1600 > > > > > > [ 984.138807] ? schedule+0x99/0x120 > > > > > > [ 984.138818] ? find_held_lock+0x2b/0x80 > > > > > > [ 984.138840] schedule+0x48/0x120 > > > > > > [ 984.138851] ? schedule+0x99/0x120 > > > > > > [ 984.138861] wait_for_serialization+0xd2/0x110 > > > > > > > > > > This is waiting for issued IO to be done, from > > > > > raid1_end_write_request > > > > > remove_serial > > > > > raid1_rb_remove > > > > > wake_up > > > > > > > > Yep, looks like this. > > > > > > > > > So the first thing need clarification is that is there > > > > > unfinished IO > > > > > from underlying disk? This is not easy, but perhaps you can > > > > > try: > > > > > > > > > > 1) don't use the underlying disks by anyone else; > > > > > 2) reporduce the problem, and then collect debugfs info for > > > > > underlying > > > > > disks with following cmd: > > > > > > > > > > find /sys/kernel/debug/block/sda/ -type f | xargs grep . > > > > > > > > I collected this and attaching to this email. > > > > When I collected this debug data I also noticed the following > > > > inflight counters: > > > > root@tux:/sys/devices/virtual/block/md77# cat inflight > > > > 0 65 > > > > root@tux:/sys/devices/virtual/block/md77# cat > > > > slaves/nvme1n1p2/inflight > > > > 0 0 > > > > root@tux:/sys/devices/virtual/block/md77# cat > > > > slaves/nvme2n1p9/inflight > > > > 0 0 > > > > > > > > So I guess on the md or raid1 level there are 65 write requests > > > > that > > > > didn't finish but nothing from underlying physical devices, > > > > right? > > > > > > Actually, IOs stuck in wait_for_serialization() are accounted by > > > inflight from md77. > > > > Do you think ftracing mdXY_raid1 kernel thread with function plugin > > will help? > > > > > However, since there are no IO from nvme disks, looks like > > > something is > > > wrong with write behind. > > > > > > > > > > > Apart from that. When the lockup/freeze happens I can mount > > > > other > > > > partitions on the corresponding nvme devices and create files > > > > there. > > > > nvme-cli util also don't show any issues AFAICS. > > > > > > > > When I manually set backlog file to zero: > > > > echo 0 > /sys/devices/virtual/block/md77/md/bitmap/backlog > > > > the lockup is no longer reproducible. > > > > > > Of course, this disables write behind, which also indicates > > > something > > > is wrong with write behind. > > > > Yes, this is in some sense a main topic of my emails. I am just > > trying > > to get you more debugging data. > > > > > I'm trying to review related code, however, this might be > > > difficult, > > > can you discribe how do you reporduce this problem in detailed > > > steps? > > > and I will try to reporduce this. If I still can't, can you apply > > > a > > > debug patch and recompile the kernel to test? > > > > I am trying to reproduce this using files and loops devices and > > I'll > > provide instructions. > > Regarding patches: yes, see my first email. I will be happy to test > > patches and collect more debug data. > > Attached patch can provide debug info for each r10_bio that is > serialized or wait for serialization, to judge if the problem is > related to r1_bio dispatching or serialization algorithm. > > After the problem is reporduced, just cat /sys/block/md77/md/debug I haven't yet tested it with that patch to collect debug info, I'll try to do it during this week. But I managed to reproduce it on another physical set of NVMe disks. As far as I understand it won't be possible to reproduce it on two fast disks. To reproduce it one of the disks should be slower -- to achieve that I connected one of the NVMe disks via pcie-to-usb adapter (raid sync speed was around 40Mb/s). Here are the instructions to reproduce it. 0. Prepare disks. I am using one physical nvme disk over usb and one loop file on fast nvme disk. time dd if=/dev/zero of=file0 bs=1M count=96k status=progress losetup -f file0 1. Create raid1 device with write-mostly+write-behind mdadm --homehost=any --create --verbose --level=1 --metadata=1.2 \ --raid-devices=2 /dev/md96 /dev/loop0 --bitmap=internal \ --write-mostly --write-behind=8192 /dev/sda 2. Format it. mkfs.ext4 -v -m 0 -b 4096 -O fast_commit,metadata_csum \ -E lazy_itable_init=0,lazy_journal_init=0 /dev/md96 3. Mount somewhere sudo mount /dev/md96 md96 -o noatime,user,exec 4. Right now in order to reproduce it I compile android mainline kernel for Pixel6 devices. I just copy the repository to the md96 but it can be synced online. https://source.android.com/docs/setup/build/building-pixel-kernels Section: Download and Compile the Kernel repo init -u https://android.googlesource.com/kernel/manifest -b gs- android-gs-raviole-mainline repo sync -f --fail-fast --force-sync --force-remove-dirty -v -j4 -c \ --no-tags --jobs-checkout=$(nproc) 5. The compilation commands: tools/bazel clean tools/bazel run --config=stamp //private/google- modules/soc/gs:slider_dist You can monitor inflight counters here. Once the second counter goes as high as 65 or larger then mostly likely the freeze/hang already happened. watch -d -n 1 cat /sys/devices/virtual/block/md96/inflight I think there should be other tests to reproduce it, it seems massive I/O is required with some long amount of write requests. Also this can be reproduced with two loop file devices -- one on fast nvme disk and another file is on slow nvme disk. I think that without loop devices it takes significantly more time to repair the filesystem and I can't confirm that there are no some file corruptions after that fsck. Hope this works. Best regards, Alexey