On Thu, 28 Dec 2023 at 12:09, Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > > Hi, > > 在 2023/12/28 12:19, Alexey Klimov 写道: > > On Thu, 14 Dec 2023 at 01:34, Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > >> > >> Hi, > >> > >> 在 2023/12/14 7:48, Alexey Klimov 写道: > >>> Hi all, > >>> > >>> After assembling raid1 consisting from two NVMe disks/partitions where > >>> one of the NVMes is slower than the other one using such command: > >>> mdadm --homehost=any --create --verbose --level=1 --metadata=1.2 > >>> --raid-devices=2 /dev/md77 /dev/nvme2n1p9 --bitmap=internal > >>> --write-mostly --write-behind=8192 /dev/nvme1n1p2 > >>> > >>> I noticed some I/O freezing/lockup issues when doing distro builds > >>> using yocto. The idea of building write-mostly raid1 came from URL > >>> [0]. I suspected that massive and long IO operations led to that and > >>> while trying to narrow it down I can see that it doesn't survive > >>> through rebuilding linux kernel (just simple make -j33). > >>> > >>> After enabling some lock checks in kernel and lockup detectors I think > >>> this is the main blocked task message: > >>> > >>> [ 984.138650] INFO: task kworker/u65:5:288 blocked for more than 491 seconds. > >>> [ 984.138682] Not tainted 6.7.0-rc5-00047-g5bd7ef53ffe5 #1 > >>> [ 984.138694] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > >>> disables this message. > >>> [ 984.138702] task:kworker/u65:5 state:D stack:0 pid:288 > >>> tgid:288 ppid:2 flags:0x00004000 > >>> [ 984.138728] Workqueue: writeback wb_workfn (flush-9:77) > >>> [ 984.138760] Call Trace: > >>> [ 984.138770] <TASK> > >>> [ 984.138785] __schedule+0x3a5/0x1600 > >>> [ 984.138807] ? schedule+0x99/0x120 > >>> [ 984.138818] ? find_held_lock+0x2b/0x80 > >>> [ 984.138840] schedule+0x48/0x120 > >>> [ 984.138851] ? schedule+0x99/0x120 > >>> [ 984.138861] wait_for_serialization+0xd2/0x110 > >> > >> This is waiting for issued IO to be done, from > >> raid1_end_write_request > >> remove_serial > >> raid1_rb_remove > >> wake_up > > > > Yep, looks like this. > > > >> So the first thing need clarification is that is there unfinished IO > >> from underlying disk? This is not easy, but perhaps you can try: > >> > >> 1) don't use the underlying disks by anyone else; > >> 2) reporduce the problem, and then collect debugfs info for underlying > >> disks with following cmd: > >> > >> find /sys/kernel/debug/block/sda/ -type f | xargs grep . > > > > I collected this and attaching to this email. > > When I collected this debug data I also noticed the following inflight counters: > > root@tux:/sys/devices/virtual/block/md77# cat inflight > > 0 65 > > root@tux:/sys/devices/virtual/block/md77# cat slaves/nvme1n1p2/inflight > > 0 0 > > root@tux:/sys/devices/virtual/block/md77# cat slaves/nvme2n1p9/inflight > > 0 0 > > > > So I guess on the md or raid1 level there are 65 write requests that > > didn't finish but nothing from underlying physical devices, right? > > Actually, IOs stuck in wait_for_serialization() are accounted by > inflight from md77. Do you think ftracing mdXY_raid1 kernel thread with function plugin will help? > However, since there are no IO from nvme disks, looks like something is > wrong with write behind. > > > > > Apart from that. When the lockup/freeze happens I can mount other > > partitions on the corresponding nvme devices and create files there. > > nvme-cli util also don't show any issues AFAICS. > > > > When I manually set backlog file to zero: > > echo 0 > /sys/devices/virtual/block/md77/md/bitmap/backlog > > the lockup is no longer reproducible. > > Of course, this disables write behind, which also indicates something > is wrong with write behind. Yes, this is in some sense a main topic of my emails. I am just trying to get you more debugging data. > I'm trying to review related code, however, this might be difficult, > can you discribe how do you reporduce this problem in detailed steps? > and I will try to reporduce this. If I still can't, can you apply a > debug patch and recompile the kernel to test? I am trying to reproduce this using files and loops devices and I'll provide instructions. Regarding patches: yes, see my first email. I will be happy to test patches and collect more debug data. Thanks, Alexey