On Fri, Nov 11, 2022 at 2:26 PM Zhang Tianci <zhangtianci.1997@xxxxxxxxxxxxx> wrote: > > On Fri, Nov 11, 2022 at 12:14 PM Song Liu <song@xxxxxxxxxx> wrote: > > > > On Thu, Nov 10, 2022 at 10:25 AM Zhang Tianci > > <zhangtianci.1997@xxxxxxxxxxxxx> wrote: > > > > > > > > > fio -filename=testfile -ioengine=libaio -bs=16M -size=10G -numjobs=100 > > > > -iodepth=1 -runtime=60 > > > > -rw=write -group_reporting -name="test" > > > > > > > > Then I found the first deadlock state, but it is not the real reason. > > > > > > > > I will do a test with the latest kernel. I will report to you the result later. > > > > > > > I can reproduce the first deadlock in linux-6.1-rc4. > > > There are 26 stripe_head and 26 fio threads blocked with same backtrace: > > > > > > #0 [ffffc9000cd0f8b0] __schedule at ffffffff818b3c3c > > > #1 [ffffc9000cd0f940] schedule at ffffffff818b4313 > > > #2 [ffffc9000cd0f950] md_bitmap_startwrite at ffffffffc063354a [md_mod] > > > #3 [ffffc9000cd0f9c0] __add_stripe_bio at ffffffffc064fbd6 [raid456] > > > #4 [ffffc9000cd0fa00] raid5_make_request at ffffffffc065a84c [raid456] > > > #5 [ffffc9000cd0fb30] md_handle_request at ffffffffc0628496 [md_mod] > > > #6 [ffffc9000cd0fb98] __submit_bio at ffffffff813f308f > > > #7 [ffffc9000cd0fbb8] submit_bio_noacct_nocheck at ffffffff813f3501 > > > #8 [ffffc9000cd0fc00] __block_write_full_page at ffffffff8134ca64 > > > #9 [ffffc9000cd0fc60] __writepage at ffffffff8123f4a3 > > > #10 [ffffc9000cd0fc78] write_cache_pages at ffffffff8123fb57 > > > #11 [ffffc9000cd0fd70] generic_writepages at ffffffff8123feef > > > #12 [ffffc9000cd0fdc0] do_writepages at ffffffff81241f12 > > > #13 [ffffc9000cd0fe28] filemap_fdatawrite_wbc at ffffffff8123306b > > > #14 [ffffc9000cd0fe48] __filemap_fdatawrite_range at ffffffff81239154 > > > #15 [ffffc9000cd0fec0] file_write_and_wait_range at ffffffff812393e1 > > > #16 [ffffc9000cd0fef0] blkdev_fsync at ffffffff813ec223 > > > #17 [ffffc9000cd0ff08] do_fsync at ffffffff81342798 > > > #18 [ffffc9000cd0ff30] __x64_sys_fsync at ffffffff813427e0 > > > #19 [ffffc9000cd0ff38] do_syscall_64 at ffffffff818a6114 > > > #20 [ffffc9000cd0ff50] entry_SYSCALL_64_after_hwframe at ffffffff81a0009b > > > > Thanks for this information. > > > > I guess this is with COUNTER_MAX of 4? And it is slightly different to the > > issue you found? > > Yes, I hack COUNTER_MAX to 4, I think this could increase the > probability of bitmap > counter racing. > > And this kind of deadlock is very difficult to happen without hacking. > > It just happened when I debugged, but it help me find a guess(the > second deadlock state > in the first email) about the real reason. > > > > > I will try to look into this next week (taking some time off this week). > > Thanks, > Tianci Hi Song, We have met this problem in a new machine, and the /proc/mdstat is not doing sync: # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md10 : active raid5 nvme9n1p1[9] nvme8n1p1[7] nvme7n1p1[6] nvme6n1p1[5] nvme5n1p1[4] nvme4n1p1[3] nvme3n1p1[2] nvme2n1p1[1] nvme1n1p1[0] 15001927680 blocks super 1.2 level 5, 512k chunk, algorithm 2 [9/9] [UUUUUUUUU] bitmap: 6/14 pages [24KB], 65536KB chunk unused devices: <none> And there are 1w5+ stripe_head are blocked. # cat /sys/block/md10/md/stripe_cache_active 15456 I guess this is the same problem as before. What do you think about my deadlock guess in the first email? Thanks, Tianci