On Feb 09, 2025 / 20:20, Ming Lei wrote: > All block internal state for dealing adding/removing debugfs entries > have been removed, and debugfs can sync everything for us in fs level, > so don't grab q->debugfs_mutex for adding/removing block internal debugfs > entries. > > Now q->debugfs_mutex is only used for blktrace, meantime move creating > queue debugfs dir code out of q->sysfs_lock. Both the two locks are > connected with queue freeze IO lock. Then queue freeze IO lock chain > with debugfs lock is cut. > > The following lockdep report can be fixed: > > https://lore.kernel.org/linux-block/ougniadskhks7uyxguxihgeuh2pv4yaqv4q3emo4gwuolgzdt6@brotly74p6bs/ > > Follows contexts which adds/removes debugfs entries: > > - update nr_hw_queues > > - add/remove disks > > - elevator switch > > - blktrace > > blktrace only adds entries under disk top directory, so we can ignore it, > because it can only work iff disk is added. Also nothing overlapped with > the other two contex, blktrace context is fine. > > Elevator switch is only allowed after disk is added, so there isn't race > with add/remove disk. blk_mq_update_nr_hw_queues() always restores to > previous elevator, so no race between these two. Elevator switch context > is fine. > > So far blk_mq_update_nr_hw_queues() doesn't hold debugfs lock for > adding/removing hctx entries, there might be race with add/remove disk, > which is just fine in reality: > > - blk_mq_update_nr_hw_queues() is usually for error recovery, and disk > won't be added/removed at the same time > > - even though there is race between the two contexts, it is just fine, > since hctx won't be freed until queue is dead > > - we never see reports in this area without holding debugfs in > blk_mq_update_nr_hw_queues() > > Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@xxxxxxx> > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> Ming, thank you for this quick action. I applied this series on top of v6.14-rc1 kernel and ran the block/002 test case. Unfortunately, still if fails occasionally with the lockdep "WARNING: possible circular locking dependency detected" below. Now debugfs_mutex is not reported as one of the dependent locks, then I think this fix is working as expected. Instead, eq->sysfs_lock creates similar dependency. My mere guess is that this patch avoids one dependency, but still another dependency is left. [ 115.085704] [ T1023] run blktests block/002 at 2025-02-10 09:22:22 [ 115.383653] [ T1054] sd 9:0:0:0: [sdd] Synchronizing SCSI cache [ 115.641933] [ T1055] scsi_debug:sdebug_driver_probe: scsi_debug: trim poll_queues to 0. poll_q/nr_hw = (0/1) [ 115.642961] [ T1055] scsi host9: scsi_debug: version 0191 [20210520] dev_size_mb=8, opts=0x0, submit_queues=1, statistics=0 [ 115.646207] [ T1055] scsi 9:0:0:0: Direct-Access Linux scsi_debug 0191 PQ: 0 ANSI: 7 [ 115.648225] [ C0] scsi 9:0:0:0: Power-on or device reset occurred [ 115.654243] [ T1055] sd 9:0:0:0: Attached scsi generic sg3 type 0 [ 115.656248] [ T100] sd 9:0:0:0: [sdd] 16384 512-byte logical blocks: (8.39 MB/8.00 MiB) [ 115.658403] [ T100] sd 9:0:0:0: [sdd] Write Protect is off [ 115.659125] [ T100] sd 9:0:0:0: [sdd] Mode Sense: 73 00 10 08 [ 115.661621] [ T100] sd 9:0:0:0: [sdd] Write cache: enabled, read cache: enabled, supports DPO and FUA [ 115.669276] [ T100] sd 9:0:0:0: [sdd] permanent stream count = 5 [ 115.673375] [ T100] sd 9:0:0:0: [sdd] Preferred minimum I/O size 512 bytes [ 115.673974] [ T100] sd 9:0:0:0: [sdd] Optimal transfer size 524288 bytes [ 115.710112] [ T100] sd 9:0:0:0: [sdd] Attached SCSI disk [ 116.464802] [ T1079] ====================================================== [ 116.465540] [ T1079] WARNING: possible circular locking dependency detected [ 116.466107] [ T1079] 6.14.0-rc1+ #253 Not tainted [ 116.466581] [ T1079] ------------------------------------------------------ [ 116.467141] [ T1079] blktrace/1079 is trying to acquire lock: [ 116.467708] [ T1079] ffff88810539d1e0 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x99/0x120 [ 116.468439] [ T1079] but task is already holding lock: [ 116.469052] [ T1079] ffff88810a5fd758 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: relay_file_read+0xa3/0x8a0 [ 116.469901] [ T1079] which lock already depends on the new lock. [ 116.470762] [ T1079] the existing dependency chain (in reverse order) is: [ 116.473187] [ T1079] -> #5 (&sb->s_type->i_mutex_key#3){++++}-{4:4}: [ 116.475670] [ T1079] down_read+0x9b/0x470 [ 116.477001] [ T1079] lookup_one_unlocked+0xe9/0x120 [ 116.478333] [ T1079] lookup_positive_unlocked+0x1d/0x90 [ 116.479648] [ T1079] debugfs_lookup+0x47/0xa0 [ 116.480833] [ T1079] blk_mq_debugfs_unregister_sched_hctx+0x23/0x50 [ 116.482215] [ T1079] blk_mq_exit_sched+0xb6/0x2b0 [ 116.483466] [ T1079] elevator_switch+0x12a/0x4b0 [ 116.484676] [ T1079] elv_iosched_store+0x29f/0x380 [ 116.485841] [ T1079] queue_attr_store+0x313/0x480 [ 116.487078] [ T1079] kernfs_fop_write_iter+0x39e/0x5a0 [ 116.488358] [ T1079] vfs_write+0x5f9/0xe90 [ 116.489460] [ T1079] ksys_write+0xf6/0x1c0 [ 116.490582] [ T1079] do_syscall_64+0x93/0x180 [ 116.491694] [ T1079] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 116.492996] [ T1079] -> #4 (&eq->sysfs_lock){+.+.}-{4:4}: [ 116.495135] [ T1079] __mutex_lock+0x1aa/0x1360 [ 116.496363] [ T1079] elevator_switch+0x11f/0x4b0 [ 116.497499] [ T1079] elv_iosched_store+0x29f/0x380 [ 116.498660] [ T1079] queue_attr_store+0x313/0x480 [ 116.499752] [ T1079] kernfs_fop_write_iter+0x39e/0x5a0 [ 116.500884] [ T1079] vfs_write+0x5f9/0xe90 [ 116.501964] [ T1079] ksys_write+0xf6/0x1c0 [ 116.503056] [ T1079] do_syscall_64+0x93/0x180 [ 116.504194] [ T1079] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 116.505356] [ T1079] -> #3 (&q->q_usage_counter(io)#2){++++}-{0:0}: [ 116.507272] [ T1079] blk_mq_submit_bio+0x19b8/0x2070 [ 116.508341] [ T1079] __submit_bio+0x36b/0x6d0 [ 116.509351] [ T1079] submit_bio_noacct_nocheck+0x542/0xca0 [ 116.510447] [ T1079] iomap_readahead+0x4c4/0x7e0 [ 116.511485] [ T1079] read_pages+0x198/0xaf0 [ 116.512468] [ T1079] page_cache_ra_order+0x4d3/0xb50 [ 116.513497] [ T1079] filemap_get_pages+0x2c7/0x1850 [ 116.514511] [ T1079] filemap_read+0x31d/0xc30 [ 116.515492] [ T1079] xfs_file_buffered_read+0x1e9/0x2a0 [xfs] [ 116.517374] [ T1079] xfs_file_read_iter+0x25f/0x4a0 [xfs] [ 116.519428] [ T1079] vfs_read+0x6bc/0xa20 [ 116.520904] [ T1079] ksys_read+0xf6/0x1c0 [ 116.522378] [ T1079] do_syscall_64+0x93/0x180 [ 116.523840] [ T1079] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 116.525467] [ T1079] -> #2 (mapping.invalidate_lock#2){++++}-{4:4}: [ 116.528191] [ T1079] down_read+0x9b/0x470 [ 116.529609] [ T1079] filemap_fault+0x231/0x2ac0 [ 116.531033] [ T1079] __do_fault+0xf4/0x5d0 [ 116.532461] [ T1079] do_fault+0x965/0x11d0 [ 116.533859] [ T1079] __handle_mm_fault+0x1060/0x1f40 [ 116.535343] [ T1079] handle_mm_fault+0x21a/0x6b0 [ 116.536915] [ T1079] do_user_addr_fault+0x322/0xaa0 [ 116.538409] [ T1079] exc_page_fault+0x7a/0x110 [ 116.539811] [ T1079] asm_exc_page_fault+0x22/0x30 [ 116.541247] [ T1079] -> #1 (&vma->vm_lock->lock){++++}-{4:4}: [ 116.543820] [ T1079] down_write+0x8d/0x200 [ 116.545202] [ T1079] vma_link+0x1ff/0x590 [ 116.546588] [ T1079] insert_vm_struct+0x15a/0x340 [ 116.548000] [ T1079] alloc_bprm+0x626/0xbf0 [ 116.549396] [ T1079] kernel_execve+0x85/0x2f0 [ 116.550784] [ T1079] call_usermodehelper_exec_async+0x21b/0x430 [ 116.552352] [ T1079] ret_from_fork+0x30/0x70 [ 116.553751] [ T1079] ret_from_fork_asm+0x1a/0x30 [ 116.555164] [ T1079] -> #0 (&mm->mmap_lock){++++}-{4:4}: [ 116.557848] [ T1079] __lock_acquire+0x2f52/0x5ea0 [ 116.559293] [ T1079] lock_acquire+0x1b1/0x540 [ 116.560678] [ T1079] __might_fault+0xb9/0x120 [ 116.562051] [ T1079] _copy_to_user+0x1e/0x80 [ 116.563446] [ T1079] relay_file_read+0x149/0x8a0 [ 116.564851] [ T1079] full_proxy_read+0x110/0x1d0 [ 116.566262] [ T1079] vfs_read+0x1bb/0xa20 [ 116.567591] [ T1079] ksys_read+0xf6/0x1c0 [ 116.568909] [ T1079] do_syscall_64+0x93/0x180 [ 116.570283] [ T1079] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 116.571781] [ T1079] other info that might help us debug this: [ 116.575369] [ T1079] Chain exists of: &mm->mmap_lock --> &eq->sysfs_lock --> &sb->s_type->i_mutex_key#3 [ 116.579323] [ T1079] Possible unsafe locking scenario: [ 116.580961] [ T1079] CPU0 CPU1 [ 116.581866] [ T1079] ---- ---- [ 116.582737] [ T1079] lock(&sb->s_type->i_mutex_key#3); [ 116.583611] [ T1079] lock(&eq->sysfs_lock); [ 116.584596] [ T1079] lock(&sb->s_type->i_mutex_key#3); [ 116.585646] [ T1079] rlock(&mm->mmap_lock); [ 116.586453] [ T1079] *** DEADLOCK *** [ 116.588484] [ T1079] 2 locks held by blktrace/1079: [ 116.589318] [ T1079] #0: ffff888101390af8 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x233/0x2f0 [ 116.590461] [ T1079] #1: ffff88810a5fd758 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: relay_file_read+0xa3/0x8a0 [ 116.591720] [ T1079] stack backtrace: [ 116.593166] [ T1079] CPU: 0 UID: 0 PID: 1079 Comm: blktrace Not tainted 6.14.0-rc1+ #253 [ 116.593169] [ T1079] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 [ 116.593172] [ T1079] Call Trace: [ 116.593176] [ T1079] <TASK> [ 116.593178] [ T1079] dump_stack_lvl+0x6a/0x90 [ 116.593186] [ T1079] print_circular_bug.cold+0x1e0/0x274 [ 116.593191] [ T1079] check_noncircular+0x306/0x3f0 [ 116.593195] [ T1079] ? __pfx_check_noncircular+0x10/0x10 [ 116.593200] [ T1079] ? lockdep_lock+0xca/0x1c0 [ 116.593202] [ T1079] ? __pfx_lockdep_lock+0x10/0x10 [ 116.593206] [ T1079] __lock_acquire+0x2f52/0x5ea0 [ 116.593211] [ T1079] ? lockdep_unlock+0xf1/0x250 [ 116.593213] [ T1079] ? __pfx___lock_acquire+0x10/0x10 [ 116.593216] [ T1079] ? lock_acquire+0x1b1/0x540 [ 116.593220] [ T1079] lock_acquire+0x1b1/0x540 [ 116.593222] [ T1079] ? __might_fault+0x99/0x120 [ 116.593226] [ T1079] ? __pfx_lock_acquire+0x10/0x10 [ 116.593228] [ T1079] ? lock_is_held_type+0xd5/0x130 [ 116.593234] [ T1079] ? __pfx___might_resched+0x10/0x10 [ 116.593239] [ T1079] ? _raw_spin_unlock+0x29/0x50 [ 116.593243] [ T1079] ? __might_fault+0x99/0x120 [ 116.593245] [ T1079] __might_fault+0xb9/0x120 [ 116.593247] [ T1079] ? __might_fault+0x99/0x120 [ 116.593249] [ T1079] ? __check_object_size+0x3f3/0x540 [ 116.593253] [ T1079] _copy_to_user+0x1e/0x80 [ 116.593256] [ T1079] relay_file_read+0x149/0x8a0 [ 116.593261] [ T1079] ? selinux_file_permission+0x36d/0x420 [ 116.593270] [ T1079] full_proxy_read+0x110/0x1d0 [ 116.593272] [ T1079] ? rw_verify_area+0x2f7/0x520 [ 116.593275] [ T1079] vfs_read+0x1bb/0xa20 [ 116.593278] [ T1079] ? __pfx___mutex_lock+0x10/0x10 [ 116.593280] [ T1079] ? __pfx_lock_release+0x10/0x10 [ 116.593283] [ T1079] ? lock_release+0x45b/0x7a0 [ 116.593285] [ T1079] ? __pfx_vfs_read+0x10/0x10 [ 116.593289] [ T1079] ? __fget_files+0x1ae/0x2e0 [ 116.593294] [ T1079] ksys_read+0xf6/0x1c0 [ 116.593296] [ T1079] ? __pfx_ksys_read+0x10/0x10 [ 116.593300] [ T1079] do_syscall_64+0x93/0x180 [ 116.593303] [ T1079] ? lockdep_hardirqs_on_prepare+0x16d/0x400 [ 116.593306] [ T1079] ? do_syscall_64+0x9f/0x180 [ 116.593308] [ T1079] ? lockdep_hardirqs_on+0x78/0x100 [ 116.593310] [ T1079] ? do_syscall_64+0x9f/0x180 [ 116.593313] [ T1079] ? __pfx_vm_mmap_pgoff+0x10/0x10 [ 116.593317] [ T1079] ? __fget_files+0x1ae/0x2e0 [ 116.593321] [ T1079] ? lockdep_hardirqs_on_prepare+0x16d/0x400 [ 116.593324] [ T1079] ? do_syscall_64+0x9f/0x180 [ 116.593326] [ T1079] ? lockdep_hardirqs_on+0x78/0x100 [ 116.593328] [ T1079] ? do_syscall_64+0x9f/0x180 [ 116.593331] [ T1079] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 116.593334] [ T1079] RIP: 0033:0x7f9586b50e4a [ 116.593338] [ T1079] Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 28 58 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 82 58 f8 ff 48 8b [ 116.593341] [ T1079] RSP: 002b:00007f9586a3ed80 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 116.593348] [ T1079] RAX: ffffffffffffffda RBX: 00007f9580000bb0 RCX: 00007f9586b50e4a [ 116.593349] [ T1079] RDX: 0000000000080000 RSI: 00007f9584800000 RDI: 0000000000000004 [ 116.593351] [ T1079] RBP: 00007f9586a3eda0 R08: 0000000000000000 R09: 0000000000000000 [ 116.593352] [ T1079] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000 [ 116.593353] [ T1079] R13: 000056218e3f97e0 R14: 0000000000000000 R15: 00007f9580002c90 [ 116.593359] [ T1079] </TASK> [ 116.687238] [ T1023] sd 9:0:0:0: [sdd] Synchronizing SCSI cache