Hi Song, I am tracking down a deadlock in Linux-5.4.56. I end up seeing multiple hung tasks with the following stack trace: #0 [ffff9dccdd79f538] __schedule at ffffffffaf993b2d #1 [ffff9dccdd79f5c8] schedule at ffffffffaf993eaa #2 [ffff9dccdd79f5e0] md_bitmap_startwrite at ffffffffc08eef61 [md_mod] #3 [ffff9dccdd79f658] add_stripe_bio at ffffffffc0a4b627 [raid456] #4 [ffff9dccdd79f6a8] raid5_make_request at ffffffffc0a508fe [raid456] #5 [ffff9dccdd79f788] md_handle_request at ffffffffc08e4920 [md_mod] #6 [ffff9dccdd79f7f0] md_make_request at ffffffffc08e4a76 [md_mod] #7 [ffff9dccdd79f818] generic_make_request at ffffffffaf5ab4bb #8 [ffff9dccdd79f878] submit_bio at ffffffffaf5ab6f8 #9 [ffff9dccdd79f8e0] ext4_io_submit at ffffffffc0834ab9 [ext4] #10 [ffff9dccdd79f8f0] ext4_bio_write_page at ffffffffc0834d7e [ext4] They are all blocked with md_bitmap_startwrite. After some debugging, I found there are two kinds of deadlock states, and I guess the latest Linux has the same problems. 1. If there are two thread's bio belonging to the same stripe_head: threadA threadB raid5_make_request raid5_get_active_stripe(stripe count=1) add_stripe_bio md_bitmap_startwrite(get bitmap count) release_stripe_plug add this stripe to blk plug ------------------- someone do unplug: raid5_unplug raid5_make_request raid5_get_active_stripe(count=2) __release_stripe add_stripe_bio stripe count=1 md_bitmap_startwrite(blocked) return Because the max bitmap counter is (1<<14 - 1), so only one stripe can not cause this kind of deadlock, but if there are enough stripe_head working like above, there would be a "AA" deadlock. Of course, This kind of deadlock is almost impossible to trigger. And The deadlock I encountered was not of this kind. But this kind of action is one component of the next kind of deadlock. 2. The special stripe_head is like the above, but thread A may do another important thing. Thread A may do `atomic_inc(&conf->preread_active_stripes);`. And This count will block raid5d to activate delayed stripes: ``` static void raid5_activate_delayed(struct r5conf *conf) { if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) { while (!list_empty(&conf->delayed_list)) { struct list_head *l = conf->delayed_list.next; struct stripe_head *sh; sh = list_entry(l, struct stripe_head, lru); list_del_init(l); clear_bit(STRIPE_DELAYED, &sh->state); if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) atomic_inc(&conf->preread_active_stripes); list_add_tail(&sh->lru, &conf->hold_list); raid5_wakeup_stripe_thread(sh); } } } ``` raid5d will only handle delayed stripe_head when `conf->preread_active_stripes < IO_THRESHOLD` where `IO_THRESHOLD` is one. So there would be one kind of ABBA deadlock: many stripe_head got a bitmap count and waiting for conf->preread_active_stripes. someone stripe_head got conf->preread_active_stripes and waiting for the bitmap count. I guess the deadlock I encountered was the second kind. There is some information about raid5 I am using: $ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md10 : active raid5 nvme9n1p1[9] nvme8n1p1[7] nvme7n1p1[6] nvme6n1p1[5] nvme5n1p1[4] nvme4n1p1[3] nvme3n1p1[2] nvme2n1p1[1] nvme1n1p1[0] 15001927680 blocks super 1.2 level 5, 512k chunk, algorithm 2 [9/9] [UUUUUUUUU] [====>................] check = 21.0% (394239024/1875240960) finish=1059475.2min speed=23K/sec bitmap: 1/14 pages [4KB], 65536KB chunk $ mdadm -D /dev/md10 /dev/md10: Version : 1.2 Creation Time : Fri Sep 23 11:47:03 2022 Raid Level : raid5 Array Size : 15001927680 (14306.95 GiB 15361.97 GB) Used Dev Size : 1875240960 (1788.37 GiB 1920.25 GB) Raid Devices : 9 Total Devices : 9 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Sun Nov 6 01:29:49 2022 State : active, checking Active Devices : 9 Working Devices : 9 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Check Status : 21% complete Name : dc02-pd-t8-n021:10 (local to host dc02-pd-t8-n021) UUID : 089300e1:45b54872:31a11457:a41ad66a Events : 3968 Number Major Minor RaidDevice State 0 259 8 0 active sync /dev/nvme1n1p1 1 259 6 1 active sync /dev/nvme2n1p1 2 259 7 2 active sync /dev/nvme3n1p1 3 259 12 3 active sync /dev/nvme4n1p1 4 259 11 4 active sync /dev/nvme5n1p1 5 259 14 5 active sync /dev/nvme6n1p1 6 259 13 6 active sync /dev/nvme7n1p1 7 259 21 7 active sync /dev/nvme8n1p1 9 259 20 8 active sync /dev/nvme9n1p1 And some internal state of the raid5 by crash or sysfs: $ cat /sys/block/md10/md/stripe_cache_active 4430 # There are so many active stripe_head crash > foreach UN bt | grep md_bitmap_startwrite | wc -l 48 # So there are only 48 stripe_head blocked by the bitmap counter. crash > list -o stripe_head.lru -s stripe_head.state -O r5conf.delayed_list -h 0xffff90c1951d5000 .... # There are so many stripe_head, and the number is 4382. There are 4430 active stripe_head, and 4382 are in delayed_list, the last 48 blocked by the bitmap counter. So I guess this is the second deadlock. Then I reviewed the changelog after the commit 391b5d39faea "md/raid5: Fix Force reconstruct-write io stuck in degraded raid5" date 2020-07-31, and found no related fixup commit. And I'm not sure my understanding of raid5 is right. So I wondering if you can help confirm whether my thoughts are right or not. And If my thoughts are right, I have one idea to fix up the first problem: We can add a field `bit_counter` into struct stripe_head, which means this stripe_head's write counter, only it from 0 become 1, we do md_bitmap_startwrite, and from 1 becomes 0, we do md_bitmap_endwrite. So one stripe_head only gets one bitmap counter. But for the second problem, I think I do not really understand the meaning of r5conf->preread_active_stripes, So I have no good idea to fix it. Thanks, Tianci