On Thu, Nov 10, 2022 at 11:24 AM Zhang Tianci <zhangtianci.1997@xxxxxxxxxxxxx> wrote: > > Hi Song, > > Thanks for your quick reply. > > On Thu, Nov 10, 2022 at 6:37 AM Song Liu <song@xxxxxxxxxx> wrote: > > > > Hi Tianci, > > > > Thanks for the report. > > > > > > On Tue, Nov 8, 2022 at 10:50 PM Zhang Tianci > > <zhangtianci.1997@xxxxxxxxxxxxx> wrote: > > > > > > Hi Song, > > > > > > I am tracking down a deadlock in Linux-5.4.56. > > > > > [...] > > > > > > $ cat /proc/mdstat > > > Personalities : [raid6] [raid5] [raid4] > > > md10 : active raid5 nvme9n1p1[9] nvme8n1p1[7] nvme7n1p1[6] > > > nvme6n1p1[5] nvme5n1p1[4] nvme4n1p1[3] nvme3n1p1[2] nvme2n1p1[1] > > > nvme1n1p1[0] > > > 15001927680 blocks super 1.2 level 5, 512k chunk, algorithm 2 > > > [9/9] [UUUUUUUUU] > > > [====>................] check = 21.0% (394239024/1875240960) > > > finish=1059475.2min speed=23K/sec > > > bitmap: 1/14 pages [4KB], 65536KB chunk > > > > How many instances of this issue do we have? If more than one, I wonder > > they are all running the raid5 check (as this one is). > > We have three instances, but reboot one machine, so we have two machines now. > > And you are right, they are all running the raid5 check, but because I > have debugged > this problem for almost three weeks, and I remember they were not > doing sync when > I first checked /proc/mdstat. > > Now their raid5_resync threads backtrace are all: > > #0 [ffffa8d15fedfb90] __schedule at ffffffffa0d93b2d > #1 [ffffa8d15fedfc20] schedule at ffffffffa0d93eaa > #2 [ffffa8d15fedfc38] md_bitmap_cond_end_sync at ffffffffc045758e [md_mod] > #3 [ffffa8d15fedfc90] raid5_sync_request at ffffffffc0637e2f [raid456] > #4 [ffffa8d15fedfcf8] md_do_sync at ffffffffc044bd1c [md_mod] > #5 [ffffa8d15fedfe98] md_thread at ffffffffc0448e50 [md_mod] > #6 [ffffa8d15fedff10] kthread at ffffffffa06b0d76 > #7 [ffffa8d15fedff50] ret_from_fork at ffffffffa0e001cf > > So I guess the resync thread is just a victim. > > > > > > > > > $ mdadm -D /dev/md10 > > > /dev/md10: > > > Version : 1.2 > > > Creation Time : Fri Sep 23 11:47:03 2022 > > > Raid Level : raid5 > > > Array Size : 15001927680 (14306.95 GiB 15361.97 GB) > > > Used Dev Size : 1875240960 (1788.37 GiB 1920.25 GB) > > > Raid Devices : 9 > > > Total Devices : 9 > > > Persistence : Superblock is persistent > > > > > > Intent Bitmap : Internal > > > > > > Update Time : Sun Nov 6 01:29:49 2022 > > > State : active, checking > > > Active Devices : 9 > > > Working Devices : 9 > > > Failed Devices : 0 > > > Spare Devices : 0 > > > > > > Layout : left-symmetric > > > Chunk Size : 512K > > > > > > Check Status : 21% complete > > > > > > Name : dc02-pd-t8-n021:10 (local to host dc02-pd-t8-n021) > > > UUID : 089300e1:45b54872:31a11457:a41ad66a > > > Events : 3968 > > > > > > Number Major Minor RaidDevice State > > > 0 259 8 0 active sync /dev/nvme1n1p1 > > > 1 259 6 1 active sync /dev/nvme2n1p1 > > > 2 259 7 2 active sync /dev/nvme3n1p1 > > > 3 259 12 3 active sync /dev/nvme4n1p1 > > > 4 259 11 4 active sync /dev/nvme5n1p1 > > > 5 259 14 5 active sync /dev/nvme6n1p1 > > > 6 259 13 6 active sync /dev/nvme7n1p1 > > > 7 259 21 7 active sync /dev/nvme8n1p1 > > > 9 259 20 8 active sync /dev/nvme9n1p1 > > > > > > And some internal state of the raid5 by crash or sysfs: > > > > > > $ cat /sys/block/md10/md/stripe_cache_active > > > 4430 # There are so many active stripe_head > > > > > > crash > foreach UN bt | grep md_bitmap_startwrite | wc -l > > > 48 # So there are only 48 stripe_head blocked by > > > the bitmap counter. > > > crash > list -o stripe_head.lru -s stripe_head.state -O > > > r5conf.delayed_list -h 0xffff90c1951d5000 > > > .... # There are so many stripe_head, and the number is 4382. > > > > > > There are 4430 active stripe_head, and 4382 are in delayed_list, the > > > last 48 blocked by the bitmap counter. > > > So I guess this is the second deadlock. > > > > > > Then I reviewed the changelog after the commit 391b5d39faea "md/raid5: > > > Fix Force reconstruct-write io stuck in degraded raid5" date > > > 2020-07-31, and found no related fixup commit. And I'm not sure my > > > understanding of raid5 is right. So I wondering if you can help > > > confirm whether my thoughts are right or not. > > > > Have you tried to reproduce this with the latest kernel? There are a few fixes > > after 2020, for example > > > > commit 3312e6c887fe7539f0adb5756ab9020282aaa3d4 > > > Thanks for your commit, I tried to understand this commit, and I think > it is an optimization > of the batch list but not a fixup? > > I have not tried to reproduce this with the latest kernel, because the > max bitmap counter > (1 << 14) is so large that I think it is difficult to reproduce. So I > create a special > environment by hacking kernel(base on 5.4.56) that changes the bitmap counter > max number(COUNTER_MAX) to 4. Then I could get a deadlock by this command: > > fio -filename=testfile -ioengine=libaio -bs=16M -size=10G -numjobs=100 > -iodepth=1 -runtime=60 > -rw=write -group_reporting -name="test" > > Then I found the first deadlock state, but it is not the real reason. > > I will do a test with the latest kernel. I will report to you the result later. > I can reproduce the first deadlock in linux-6.1-rc4. There are 26 stripe_head and 26 fio threads blocked with same backtrace: #0 [ffffc9000cd0f8b0] __schedule at ffffffff818b3c3c #1 [ffffc9000cd0f940] schedule at ffffffff818b4313 #2 [ffffc9000cd0f950] md_bitmap_startwrite at ffffffffc063354a [md_mod] #3 [ffffc9000cd0f9c0] __add_stripe_bio at ffffffffc064fbd6 [raid456] #4 [ffffc9000cd0fa00] raid5_make_request at ffffffffc065a84c [raid456] #5 [ffffc9000cd0fb30] md_handle_request at ffffffffc0628496 [md_mod] #6 [ffffc9000cd0fb98] __submit_bio at ffffffff813f308f #7 [ffffc9000cd0fbb8] submit_bio_noacct_nocheck at ffffffff813f3501 #8 [ffffc9000cd0fc00] __block_write_full_page at ffffffff8134ca64 #9 [ffffc9000cd0fc60] __writepage at ffffffff8123f4a3 #10 [ffffc9000cd0fc78] write_cache_pages at ffffffff8123fb57 #11 [ffffc9000cd0fd70] generic_writepages at ffffffff8123feef #12 [ffffc9000cd0fdc0] do_writepages at ffffffff81241f12 #13 [ffffc9000cd0fe28] filemap_fdatawrite_wbc at ffffffff8123306b #14 [ffffc9000cd0fe48] __filemap_fdatawrite_range at ffffffff81239154 #15 [ffffc9000cd0fec0] file_write_and_wait_range at ffffffff812393e1 #16 [ffffc9000cd0fef0] blkdev_fsync at ffffffff813ec223 #17 [ffffc9000cd0ff08] do_fsync at ffffffff81342798 #18 [ffffc9000cd0ff30] __x64_sys_fsync at ffffffff813427e0 #19 [ffffc9000cd0ff38] do_syscall_64 at ffffffff818a6114 #20 [ffffc9000cd0ff50] entry_SYSCALL_64_after_hwframe at ffffffff81a0009b > Thanks, > Tianci