On 25.10.2024 02:37, Ming Lei wrote: > Recently we got several deadlock report[1][2][3] caused by > blk_mq_freeze_queue and blk_enter_queue(). > > Turns out the two are just like acquiring read/write lock, so model them > as read/write lock for supporting lockdep: > > 1) model q->q_usage_counter as two locks(io and queue lock) > > - queue lock covers sync with blk_enter_queue() > > - io lock covers sync with bio_enter_queue() > > 2) make the lockdep class/key as per-queue: > > - different subsystem has very different lock use pattern, shared lock > class causes false positive easily > > - freeze_queue degrades to no lock in case that disk state becomes DEAD > because bio_enter_queue() won't be blocked any more > > - freeze_queue degrades to no lock in case that request queue becomes dying > because blk_enter_queue() won't be blocked any more > > 3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock > - it is exclusive lock, so dependency with blk_enter_queue() is covered > > - it is trylock because blk_mq_freeze_queue() are allowed to run > concurrently > > 4) model blk_enter_queue() & bio_enter_queue() as acquire_read() > - nested blk_enter_queue() are allowed > > - dependency with blk_mq_freeze_queue() is covered > > - blk_queue_exit() is often called from other contexts(such as irq), and > it can't be annotated as lock_release(), so simply do it in > blk_enter_queue(), this way still covered cases as many as possible > > With lockdep support, such kind of reports may be reported asap and > needn't wait until the real deadlock is triggered. > > For example, lockdep report can be triggered in the report[3] with this > patch applied. > > [1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler' > https://bugzilla.kernel.org/show_bug.cgi?id=219166 > > [2] del_gendisk() vs blk_queue_enter() race condition > https://lore.kernel.org/linux-block/20241003085610.GK11458@xxxxxxxxxx/ > > [3] queue_freeze & queue_enter deadlock in scsi > https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u > > Reviewed-by: Christoph Hellwig <hch@xxxxxx> > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> This patch landed yesterday in linux-next as commit f1be1788a32e ("block: model freeze & enter queue as lock for supporting lockdep"). In my tests I found that it introduces the following 2 lockdep warnings: 1. On Samsung Exynos 4412-based Odroid U3 board (ARM 32bit), observed when booting it: ====================================================== WARNING: possible circular locking dependency detected 6.12.0-rc4-00037-gf1be1788a32e #9290 Not tainted ------------------------------------------------------ find/1284 is trying to acquire lock: cf3b8534 (&mm->mmap_lock){++++}-{3:3}, at: __might_fault+0x30/0x70 but task is already holding lock: c203a0c8 (&sb->s_type->i_mutex_key#2){++++}-{3:3}, at: iterate_dir+0x30/0x140 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (&sb->s_type->i_mutex_key#2){++++}-{3:3}: down_write+0x44/0xc4 start_creating+0x8c/0x170 debugfs_create_dir+0x1c/0x178 blk_register_queue+0xa0/0x1c0 add_disk_fwnode+0x210/0x434 brd_alloc+0x1cc/0x210 brd_init+0xac/0x104 do_one_initcall+0x64/0x30c kernel_init_freeable+0x1c4/0x228 kernel_init+0x1c/0x12c ret_from_fork+0x14/0x28 -> #3 (&q->debugfs_mutex){+.+.}-{3:3}: __mutex_lock+0x94/0x94c mutex_lock_nested+0x1c/0x24 blk_mq_init_sched+0x140/0x204 elevator_init_mq+0xb8/0x130 add_disk_fwnode+0x3c/0x434 mmc_blk_alloc_req+0x34c/0x464 mmc_blk_probe+0x1f4/0x6f8 really_probe+0xe0/0x3d8 __driver_probe_device+0x9c/0x1e4 driver_probe_device+0x30/0xc0 __device_attach_driver+0xa8/0x120 bus_for_each_drv+0x80/0xcc __device_attach+0xac/0x1fc bus_probe_device+0x8c/0x90 device_add+0x5a4/0x7cc mmc_add_card+0x118/0x2c8 mmc_attach_mmc+0xd8/0x174 mmc_rescan+0x2ec/0x3a8 process_one_work+0x240/0x6d0 worker_thread+0x1a0/0x398 kthread+0x104/0x138 ret_from_fork+0x14/0x28 -> #2 (&q->q_usage_counter(io)#17){++++}-{0:0}: blk_mq_submit_bio+0x8dc/0xb34 __submit_bio+0x78/0x148 submit_bio_noacct_nocheck+0x204/0x32c ext4_mpage_readpages+0x558/0x7b0 read_pages+0x64/0x28c page_cache_ra_unbounded+0x118/0x1bc filemap_get_pages+0x124/0x7ec filemap_read+0x174/0x5b0 __kernel_read+0x128/0x2c0 bprm_execve+0x230/0x7a4 kernel_execve+0xec/0x194 try_to_run_init_process+0xc/0x38 kernel_init+0xdc/0x12c ret_from_fork+0x14/0x28 -> #1 (mapping.invalidate_lock#2){++++}-{3:3}: down_read+0x44/0x224 filemap_fault+0x648/0x10f0 __do_fault+0x38/0xd4 handle_mm_fault+0xaf8/0x14d0 do_page_fault+0xe0/0x5c8 do_DataAbort+0x3c/0xb0 __dabt_svc+0x50/0x80 __clear_user_std+0x34/0x68 elf_load+0x1a8/0x208 load_elf_binary+0x3f4/0x13cc bprm_execve+0x28c/0x7a4 do_execveat_common+0x150/0x198 sys_execve+0x30/0x38 ret_fast_syscall+0x0/0x1c -> #0 (&mm->mmap_lock){++++}-{3:3}: __lock_acquire+0x158c/0x2970 lock_acquire+0x130/0x384 __might_fault+0x50/0x70 filldir64+0x94/0x28c dcache_readdir+0x174/0x260 iterate_dir+0x64/0x140 sys_getdents64+0x60/0x130 ret_fast_syscall+0x0/0x1c other info that might help us debug this: Chain exists of: &mm->mmap_lock --> &q->debugfs_mutex --> &sb->s_type->i_mutex_key#2 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- rlock(&sb->s_type->i_mutex_key#2); lock(&q->debugfs_mutex); lock(&sb->s_type->i_mutex_key#2); rlock(&mm->mmap_lock); *** DEADLOCK *** 2 locks held by find/1284: #0: c3df1e88 (&f->f_pos_lock){+.+.}-{3:3}, at: fdget_pos+0x88/0xd0 #1: c203a0c8 (&sb->s_type->i_mutex_key#2){++++}-{3:3}, at: iterate_dir+0x30/0x140 stack backtrace: CPU: 1 UID: 0 PID: 1284 Comm: find Not tainted 6.12.0-rc4-00037-gf1be1788a32e #9290 Hardware name: Samsung Exynos (Flattened Device Tree) Call trace: unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x68/0x88 dump_stack_lvl from print_circular_bug+0x31c/0x394 print_circular_bug from check_noncircular+0x16c/0x184 check_noncircular from __lock_acquire+0x158c/0x2970 __lock_acquire from lock_acquire+0x130/0x384 lock_acquire from __might_fault+0x50/0x70 __might_fault from filldir64+0x94/0x28c filldir64 from dcache_readdir+0x174/0x260 dcache_readdir from iterate_dir+0x64/0x140 iterate_dir from sys_getdents64+0x60/0x130 sys_getdents64 from ret_fast_syscall+0x0/0x1c Exception stack(0xf22b5fa8 to 0xf22b5ff0) 5fa0: 004b4fa0 004b4f80 00000004 004b4fa0 00008000 00000000 5fc0: 004b4fa0 004b4f80 00000001 000000d9 00000000 004b4af0 00000000 000010ea 5fe0: 004b1eb4 bea05af0 b6da4b08 b6da4a28 --->8--- 2. On QEMU's ARM64 virt machine, observed during system suspend/resume cycle: # time rtcwake -s10 -mmem rtcwake: wakeup from "mem" using /dev/rtc0 at Tue Oct 29 11:54:30 2024 PM: suspend entry (s2idle) Filesystems sync: 0.004 seconds Freezing user space processes Freezing user space processes completed (elapsed 0.007 seconds) OOM killer disabled. Freezing remaining freezable tasks Freezing remaining freezable tasks completed (elapsed 0.004 seconds) ====================================================== WARNING: possible circular locking dependency detected 6.12.0-rc4+ #9291 Not tainted ------------------------------------------------------ rtcwake/1299 is trying to acquire lock: ffff80008358a7f8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock+0x1c/0x28 but task is already holding lock: ffff000006136d68 (&q->q_usage_counter(io)#5){++++}-{0:0}, at: virtblk_freeze+0x24/0x60 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&q->q_usage_counter(io)#5){++++}-{0:0}: blk_mq_submit_bio+0x7c0/0x9d8 __submit_bio+0x74/0x164 submit_bio_noacct_nocheck+0x2d4/0x3b4 submit_bio_noacct+0x148/0x3fc submit_bio+0x130/0x204 submit_bh_wbc+0x148/0x1bc submit_bh+0x18/0x24 ext4_read_bh_nowait+0x70/0x100 ext4_sb_breadahead_unmovable+0x50/0x80 __ext4_get_inode_loc+0x354/0x654 ext4_get_inode_loc+0x40/0xa8 ext4_reserve_inode_write+0x40/0xf0 __ext4_mark_inode_dirty+0x88/0x300 ext4_ext_tree_init+0x40/0x4c __ext4_new_inode+0x99c/0x1614 ext4_create+0xe4/0x1d4 lookup_open.isra.0+0x47c/0x540 path_openat+0x370/0x9e8 do_filp_open+0x80/0x130 do_sys_openat2+0xb4/0xe8 __arm64_compat_sys_openat+0x5c/0xa4 invoke_syscall+0x48/0x110 el0_svc_common.constprop.0+0x40/0xe8 do_el0_svc_compat+0x20/0x3c el0_svc_compat+0x44/0xe0 el0t_32_sync_handler+0x98/0x148 el0t_32_sync+0x194/0x198 -> #2 (jbd2_handle){++++}-{0:0}: start_this_handle+0x178/0x4e8 jbd2__journal_start+0x110/0x298 __ext4_journal_start_sb+0x9c/0x274 ext4_dirty_inode+0x38/0x88 __mark_inode_dirty+0x90/0x568 generic_update_time+0x50/0x64 touch_atime+0x2c0/0x324 ext4_file_mmap+0x68/0x88 mmap_region+0x448/0xa38 do_mmap+0x3dc/0x540 vm_mmap_pgoff+0xf8/0x1b4 ksys_mmap_pgoff+0x148/0x1f0 __arm64_compat_sys_aarch32_mmap2+0x20/0x2c invoke_syscall+0x48/0x110 el0_svc_common.constprop.0+0x40/0xe8 do_el0_svc_compat+0x20/0x3c el0_svc_compat+0x44/0xe0 el0t_32_sync_handler+0x98/0x148 el0t_32_sync+0x194/0x198 -> #1 (&mm->mmap_lock){++++}-{3:3}: __might_fault+0x5c/0x80 inet_gifconf+0xcc/0x278 dev_ifconf+0xc4/0x1f8 sock_ioctl+0x234/0x384 compat_sock_ioctl+0x180/0x35c __arm64_compat_sys_ioctl+0x14c/0x16c invoke_syscall+0x48/0x110 el0_svc_common.constprop.0+0x40/0xe8 do_el0_svc_compat+0x20/0x3c el0_svc_compat+0x44/0xe0 el0t_32_sync_handler+0x98/0x148 el0t_32_sync+0x194/0x198 -> #0 (rtnl_mutex){+.+.}-{3:3}: __lock_acquire+0x1374/0x2224 lock_acquire+0x200/0x340 __mutex_lock+0x98/0x428 mutex_lock_nested+0x24/0x30 rtnl_lock+0x1c/0x28 virtnet_freeze_down.isra.0+0x20/0x9c virtnet_freeze+0x44/0x60 virtio_device_freeze+0x68/0x94 virtio_mmio_freeze+0x14/0x20 platform_pm_suspend+0x2c/0x6c dpm_run_callback+0xa4/0x218 device_suspend+0x12c/0x52c dpm_suspend+0x10c/0x2e4 dpm_suspend_start+0x70/0x78 suspend_devices_and_enter+0x130/0x798 pm_suspend+0x1ac/0x2e8 state_store+0x8c/0x110 kobj_attr_store+0x18/0x2c sysfs_kf_write+0x4c/0x78 kernfs_fop_write_iter+0x120/0x1b4 vfs_write+0x2b0/0x35c ksys_write+0x68/0xf4 __arm64_sys_write+0x1c/0x28 invoke_syscall+0x48/0x110 el0_svc_common.constprop.0+0x40/0xe8 do_el0_svc_compat+0x20/0x3c el0_svc_compat+0x44/0xe0 el0t_32_sync_handler+0x98/0x148 el0t_32_sync+0x194/0x198 other info that might help us debug this: Chain exists of: rtnl_mutex --> jbd2_handle --> &q->q_usage_counter(io)#5 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->q_usage_counter(io)#5); lock(jbd2_handle); lock(&q->q_usage_counter(io)#5); lock(rtnl_mutex); *** DEADLOCK *** 9 locks held by rtcwake/1299: #0: ffff0000069103f8 (sb_writers#7){.+.+}-{0:0}, at: vfs_write+0x1e4/0x35c #1: ffff000007c0ae88 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf0/0x1b4 #2: ffff0000046906e8 (kn->active#24){.+.+}-{0:0}, at: kernfs_fop_write_iter+0xf8/0x1b4 #3: ffff800082fd9908 (system_transition_mutex){+.+.}-{3:3}, at: pm_suspend+0x88/0x2e8 #4: ffff000006137678 (&q->q_usage_counter(io)#6){++++}-{0:0}, at: virtblk_freeze+0x24/0x60 #5: ffff0000061376b0 (&q->q_usage_counter(queue)#2){++++}-{0:0}, at: virtblk_freeze+0x24/0x60 #6: ffff000006136d68 (&q->q_usage_counter(io)#5){++++}-{0:0}, at: virtblk_freeze+0x24/0x60 #7: ffff000006136da0 (&q->q_usage_counter(queue)){++++}-{0:0}, at: virtblk_freeze+0x24/0x60 #8: ffff0000056208f8 (&dev->mutex){....}-{3:3}, at: device_suspend+0xf8/0x52c stack backtrace: CPU: 0 UID: 0 PID: 1299 Comm: rtcwake Not tainted 6.12.0-rc4+ #9291 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x94/0xec show_stack+0x18/0x24 dump_stack_lvl+0x90/0xd0 dump_stack+0x18/0x24 print_circular_bug+0x298/0x37c check_noncircular+0x15c/0x170 __lock_acquire+0x1374/0x2224 lock_acquire+0x200/0x340 __mutex_lock+0x98/0x428 mutex_lock_nested+0x24/0x30 rtnl_lock+0x1c/0x28 virtnet_freeze_down.isra.0+0x20/0x9c virtnet_freeze+0x44/0x60 virtio_device_freeze+0x68/0x94 virtio_mmio_freeze+0x14/0x20 platform_pm_suspend+0x2c/0x6c dpm_run_callback+0xa4/0x218 device_suspend+0x12c/0x52c dpm_suspend+0x10c/0x2e4 dpm_suspend_start+0x70/0x78 suspend_devices_and_enter+0x130/0x798 pm_suspend+0x1ac/0x2e8 state_store+0x8c/0x110 kobj_attr_store+0x18/0x2c sysfs_kf_write+0x4c/0x78 kernfs_fop_write_iter+0x120/0x1b4 vfs_write+0x2b0/0x35c ksys_write+0x68/0xf4 __arm64_sys_write+0x1c/0x28 invoke_syscall+0x48/0x110 el0_svc_common.constprop.0+0x40/0xe8 do_el0_svc_compat+0x20/0x3c el0_svc_compat+0x44/0xe0 el0t_32_sync_handler+0x98/0x148 el0t_32_sync+0x194/0x198 virtio_blk virtio2: 1/0/0 default/read/poll queues virtio_blk virtio3: 1/0/0 default/read/poll queues OOM killer enabled. Restarting tasks ... done. random: crng reseeded on system resumption PM: suspend exit --->8--- Let me know if You need more information about my test systems. > --- > block/blk-core.c | 18 ++++++++++++++++-- > block/blk-mq.c | 26 ++++++++++++++++++++++---- > block/blk.h | 29 ++++++++++++++++++++++++++--- > block/genhd.c | 15 +++++++++++---- > include/linux/blkdev.h | 6 ++++++ > 5 files changed, 81 insertions(+), 13 deletions(-) > ... Best regards -- Marek Szyprowski, PhD Samsung R&D Institute Poland