On Feb 10, 2025 / 12:31, Nilay Shroff wrote: > > > On 2/10/25 7:12 AM, Ming Lei wrote: > > On Mon, Feb 10, 2025 at 8:52 AM Shinichiro Kawasaki > > <shinichiro.kawasaki@xxxxxxx> wrote: > >> > >> On Feb 09, 2025 / 20:20, Ming Lei wrote: > >>> All block internal state for dealing adding/removing debugfs entries > >>> have been removed, and debugfs can sync everything for us in fs level, > >>> so don't grab q->debugfs_mutex for adding/removing block internal debugfs > >>> entries. > >>> > >>> Now q->debugfs_mutex is only used for blktrace, meantime move creating > >>> queue debugfs dir code out of q->sysfs_lock. Both the two locks are > >>> connected with queue freeze IO lock. Then queue freeze IO lock chain > >>> with debugfs lock is cut. > >>> > >>> The following lockdep report can be fixed: > >>> > >>> https://lore.kernel.org/linux-block/ougniadskhks7uyxguxihgeuh2pv4yaqv4q3emo4gwuolgzdt6@brotly74p6bs/ > >>> > >>> Follows contexts which adds/removes debugfs entries: > >>> > >>> - update nr_hw_queues > >>> > >>> - add/remove disks > >>> > >>> - elevator switch > >>> > >>> - blktrace > >>> > >>> blktrace only adds entries under disk top directory, so we can ignore it, > >>> because it can only work iff disk is added. Also nothing overlapped with > >>> the other two contex, blktrace context is fine. > >>> > >>> Elevator switch is only allowed after disk is added, so there isn't race > >>> with add/remove disk. blk_mq_update_nr_hw_queues() always restores to > >>> previous elevator, so no race between these two. Elevator switch context > >>> is fine. > >>> > >>> So far blk_mq_update_nr_hw_queues() doesn't hold debugfs lock for > >>> adding/removing hctx entries, there might be race with add/remove disk, > >>> which is just fine in reality: > >>> > >>> - blk_mq_update_nr_hw_queues() is usually for error recovery, and disk > >>> won't be added/removed at the same time > >>> > >>> - even though there is race between the two contexts, it is just fine, > >>> since hctx won't be freed until queue is dead > >>> > >>> - we never see reports in this area without holding debugfs in > >>> blk_mq_update_nr_hw_queues() > >>> > >>> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@xxxxxxx> > >>> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > >> > >> Ming, thank you for this quick action. I applied this series on top of > >> v6.14-rc1 kernel and ran the block/002 test case. Unfortunately, still if fails > >> occasionally with the lockdep "WARNING: possible circular locking dependency > >> detected" below. Now debugfs_mutex is not reported as one of the dependent > >> locks, then I think this fix is working as expected. Instead, eq->sysfs_lock > >> creates similar dependency. My mere guess is that this patch avoids one > >> dependency, but still another dependency is left. > > > > Indeed, this patch cuts dependency on both q->sysfs_lock and q->debugfs_lock, > Glad to see that with this patch we're able to cut the dependency between > q->sysfs_lock and q->debugfs_lock. > > > but elevator ->sysfs_lock isn't covered, :-( > > > I believe that shall be fixed with the current effort undergoing here: > https://lore.kernel.org/all/20250205144506.663819-1-nilay@xxxxxxxxxxxxx/ Thanks Nilay for the information. I applied your patch series on top of v6.14-rc1 and Ming's. Then I ran block/002. Alas, still the test case fails with the "WARNING: possible circular locking dependency detected". The lockdep splat contents has changed as below: [ 222.549099] [ T1041] run blktests block/002 at 2025-02-10 17:13:18 [ 222.887999] [ T1072] sd 9:0:0:0: [sdd] Synchronizing SCSI cache [ 223.172864] [ T1073] scsi_debug:sdebug_driver_probe: scsi_debug: trim poll_queues to 0. poll_q/nr_hw = (0/1) [ 223.173872] [ T1073] scsi host9: scsi_debug: version 0191 [20210520] dev_size_mb=8, opts=0x0, submit_queues=1, statistics=0 [ 223.177251] [ T1073] scsi 9:0:0:0: Direct-Access Linux scsi_debug 0191 PQ: 0 ANSI: 7 [ 223.179393] [ C0] scsi 9:0:0:0: Power-on or device reset occurred [ 223.185351] [ T1073] sd 9:0:0:0: Attached scsi generic sg3 type 0 [ 223.187514] [ T101] sd 9:0:0:0: [sdd] 16384 512-byte logical blocks: (8.39 MB/8.00 MiB) [ 223.189675] [ T101] sd 9:0:0:0: [sdd] Write Protect is off [ 223.190146] [ T101] sd 9:0:0:0: [sdd] Mode Sense: 73 00 10 08 [ 223.192369] [ T101] sd 9:0:0:0: [sdd] Write cache: enabled, read cache: enabled, supports DPO and FUA [ 223.199999] [ T101] sd 9:0:0:0: [sdd] permanent stream count = 5 [ 223.203950] [ T101] sd 9:0:0:0: [sdd] Preferred minimum I/O size 512 bytes [ 223.204568] [ T101] sd 9:0:0:0: [sdd] Optimal transfer size 524288 bytes [ 223.245371] [ T101] sd 9:0:0:0: [sdd] Attached SCSI disk [ 223.999340] [ T1097] ====================================================== [ 224.000044] [ T1097] WARNING: possible circular locking dependency detected [ 224.000651] [ T1097] 6.14.0-rc1+ #255 Not tainted [ 224.001109] [ T1097] ------------------------------------------------------ [ 224.001791] [ T1097] blktrace/1097 is trying to acquire lock: [ 224.002220] [ T1097] ffff88814c53bde0 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x99/0x120 [ 224.002887] [ T1097] but task is already holding lock: [ 224.003423] [ T1097] ffff88810079dc68 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: relay_file_read+0xa3/0x8a0 [ 224.004173] [ T1097] which lock already depends on the new lock. [ 224.004934] [ T1097] the existing dependency chain (in reverse order) is: [ 224.007356] [ T1097] -> #6 (&sb->s_type->i_mutex_key#3){++++}-{4:4}: [ 224.009859] [ T1097] down_write+0x8d/0x200 [ 224.011109] [ T1097] start_creating.part.0+0x82/0x230 [ 224.012465] [ T1097] debugfs_create_dir+0x3a/0x4c0 [ 224.013793] [ T1097] blk_mq_debugfs_register_rqos+0x1a5/0x3a0 [ 224.015101] [ T1097] rq_qos_add+0x2f6/0x450 [ 224.016312] [ T1097] wbt_init+0x359/0x490 [ 224.017507] [ T1097] blk_register_queue+0x1c0/0x330 [ 224.018800] [ T1097] add_disk_fwnode+0x6b1/0x1010 [ 224.020047] [ T1097] sd_probe+0x94e/0xf30 [ 224.021235] [ T1097] really_probe+0x1e3/0x8a0 [ 224.022402] [ T1097] __driver_probe_device+0x18c/0x370 [ 224.023652] [ T1097] driver_probe_device+0x4a/0x120 [ 224.024816] [ T1097] __device_attach_driver+0x15e/0x270 [ 224.026038] [ T1097] bus_for_each_drv+0x114/0x1a0 [ 224.027212] [ T1097] __device_attach_async_helper+0x19c/0x240 [ 224.028476] [ T1097] async_run_entry_fn+0x96/0x4f0 [ 224.029664] [ T1097] process_one_work+0x85a/0x1460 [ 224.030798] [ T1097] worker_thread+0x5e2/0xfc0 [ 224.032037] [ T1097] kthread+0x39d/0x750 [ 224.033136] [ T1097] ret_from_fork+0x30/0x70 [ 224.034232] [ T1097] ret_from_fork_asm+0x1a/0x30 [ 224.035360] [ T1097] -> #5 (&q->rq_qos_mutex){+.+.}-{4:4}: [ 224.037338] [ T1097] __mutex_lock+0x1aa/0x1360 [ 224.038397] [ T1097] wbt_init+0x343/0x490 [ 224.039497] [ T1097] blk_register_queue+0x1c0/0x330 [ 224.040612] [ T1097] add_disk_fwnode+0x6b1/0x1010 [ 224.041675] [ T1097] sd_probe+0x94e/0xf30 [ 224.042662] [ T1097] really_probe+0x1e3/0x8a0 [ 224.043672] [ T1097] __driver_probe_device+0x18c/0x370 [ 224.044728] [ T1097] driver_probe_device+0x4a/0x120 [ 224.045761] [ T1097] __device_attach_driver+0x15e/0x270 [ 224.046818] [ T1097] bus_for_each_drv+0x114/0x1a0 [ 224.047842] [ T1097] __device_attach_async_helper+0x19c/0x240 [ 224.048876] [ T1097] async_run_entry_fn+0x96/0x4f0 [ 224.049840] [ T1097] process_one_work+0x85a/0x1460 [ 224.050794] [ T1097] worker_thread+0x5e2/0xfc0 [ 224.051718] [ T1097] kthread+0x39d/0x750 [ 224.052653] [ T1097] ret_from_fork+0x30/0x70 [ 224.053548] [ T1097] ret_from_fork_asm+0x1a/0x30 [ 224.054492] [ T1097] -> #4 (&q->sysfs_lock){+.+.}-{4:4}: [ 224.056073] [ T1097] __mutex_lock+0x1aa/0x1360 [ 224.056934] [ T1097] elv_iosched_store+0xf2/0x500 [ 224.057809] [ T1097] queue_attr_store+0x35a/0x480 [ 224.058680] [ T1097] kernfs_fop_write_iter+0x39e/0x5a0 [ 224.059593] [ T1097] vfs_write+0x5f9/0xe90 [ 224.060449] [ T1097] ksys_write+0xf6/0x1c0 [ 224.061294] [ T1097] do_syscall_64+0x93/0x180 [ 224.062153] [ T1097] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 224.063087] [ T1097] -> #3 (&q->q_usage_counter(io)#2){++++}-{0:0}: [ 224.064675] [ T1097] blk_mq_submit_bio+0x19b8/0x2070 [ 224.065562] [ T1097] __submit_bio+0x36b/0x6d0 [ 224.066428] [ T1097] submit_bio_noacct_nocheck+0x542/0xca0 [ 224.067371] [ T1097] iomap_readahead+0x4c4/0x7e0 [ 224.068254] [ T1097] read_pages+0x198/0xaf0 [ 224.069085] [ T1097] page_cache_ra_order+0x4d3/0xb50 [ 224.069976] [ T1097] filemap_get_pages+0x2c7/0x1850 [ 224.070862] [ T1097] filemap_read+0x31d/0xc30 [ 224.071707] [ T1097] xfs_file_buffered_read+0x1e9/0x2a0 [xfs] [ 224.072927] [ T1097] xfs_file_read_iter+0x25f/0x4a0 [xfs] [ 224.074093] [ T1097] vfs_read+0x6bc/0xa20 [ 224.074918] [ T1097] ksys_read+0xf6/0x1c0 [ 224.075735] [ T1097] do_syscall_64+0x93/0x180 [ 224.076576] [ T1097] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 224.077534] [ T1097] -> #2 (mapping.invalidate_lock#2){++++}-{4:4}: [ 224.079118] [ T1097] down_read+0x9b/0x470 [ 224.079950] [ T1097] filemap_fault+0x231/0x2ac0 [ 224.080818] [ T1097] __do_fault+0xf4/0x5d0 [ 224.081658] [ T1097] do_fault+0x965/0x11d0 [ 224.082495] [ T1097] __handle_mm_fault+0x1060/0x1f40 [ 224.083408] [ T1097] handle_mm_fault+0x21a/0x6b0 [ 224.084297] [ T1097] do_user_addr_fault+0x322/0xaa0 [ 224.085205] [ T1097] exc_page_fault+0x7a/0x110 [ 224.086061] [ T1097] asm_exc_page_fault+0x22/0x30 [ 224.086941] [ T1097] -> #1 (&vma->vm_lock->lock){++++}-{4:4}: [ 224.088495] [ T1097] down_write+0x8d/0x200 [ 224.089353] [ T1097] vma_link+0x1ff/0x590 [ 224.090200] [ T1097] insert_vm_struct+0x15a/0x340 [ 224.091072] [ T1097] alloc_bprm+0x626/0xbf0 [ 224.091909] [ T1097] kernel_execve+0x85/0x2f0 [ 224.092754] [ T1097] call_usermodehelper_exec_async+0x21b/0x430 [ 224.093738] [ T1097] ret_from_fork+0x30/0x70 [ 224.094587] [ T1097] ret_from_fork_asm+0x1a/0x30 [ 224.095541] [ T1097] -> #0 (&mm->mmap_lock){++++}-{4:4}: [ 224.097077] [ T1097] __lock_acquire+0x2f52/0x5ea0 [ 224.097963] [ T1097] lock_acquire+0x1b1/0x540 [ 224.098818] [ T1097] __might_fault+0xb9/0x120 [ 224.099672] [ T1097] _copy_to_user+0x1e/0x80 [ 224.100519] [ T1097] relay_file_read+0x149/0x8a0 [ 224.101408] [ T1097] full_proxy_read+0x110/0x1d0 [ 224.102297] [ T1097] vfs_read+0x1bb/0xa20 [ 224.103115] [ T1097] ksys_read+0xf6/0x1c0 [ 224.103937] [ T1097] do_syscall_64+0x93/0x180 [ 224.104782] [ T1097] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 224.105726] [ T1097] other info that might help us debug this: [ 224.107884] [ T1097] Chain exists of: &mm->mmap_lock --> &q->rq_qos_mutex --> &sb->s_type->i_mutex_key#3 [ 224.110335] [ T1097] Possible unsafe locking scenario: [ 224.111856] [ T1097] CPU0 CPU1 [ 224.112723] [ T1097] ---- ---- [ 224.113588] [ T1097] lock(&sb->s_type->i_mutex_key#3); [ 224.114492] [ T1097] lock(&q->rq_qos_mutex); [ 224.115499] [ T1097] lock(&sb->s_type->i_mutex_key#3); [ 224.116609] [ T1097] rlock(&mm->mmap_lock); [ 224.117440] [ T1097] *** DEADLOCK *** [ 224.119479] [ T1097] 2 locks held by blktrace/1097: [ 224.120325] [ T1097] #0: ffff88812e5b80f8 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x233/0x2f0 [ 224.121474] [ T1097] #1: ffff88810079dc68 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: relay_file_read+0xa3/0x8a0 [ 224.122741] [ T1097] stack backtrace: [ 224.124186] [ T1097] CPU: 0 UID: 0 PID: 1097 Comm: blktrace Not tainted 6.14.0-rc1+ #255 [ 224.124189] [ T1097] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 [ 224.124193] [ T1097] Call Trace: [ 224.124195] [ T1097] <TASK> [ 224.124197] [ T1097] dump_stack_lvl+0x6a/0x90 [ 224.124201] [ T1097] print_circular_bug.cold+0x1e0/0x274 [ 224.124205] [ T1097] check_noncircular+0x306/0x3f0 [ 224.124209] [ T1097] ? __pfx_check_noncircular+0x10/0x10 [ 224.124213] [ T1097] ? lockdep_lock+0xca/0x1c0 [ 224.124215] [ T1097] ? __pfx_lockdep_lock+0x10/0x10 [ 224.124219] [ T1097] __lock_acquire+0x2f52/0x5ea0 [ 224.124224] [ T1097] ? find_held_lock+0x2d/0x110 [ 224.124226] [ T1097] ? __pfx___lock_acquire+0x10/0x10 [ 224.124230] [ T1097] ? lock_acquire+0x1b1/0x540 [ 224.124233] [ T1097] lock_acquire+0x1b1/0x540 [ 224.124236] [ T1097] ? __might_fault+0x99/0x120 [ 224.124239] [ T1097] ? __pfx_lock_acquire+0x10/0x10 [ 224.124242] [ T1097] ? lock_is_held_type+0xd5/0x130 [ 224.124245] [ T1097] ? __pfx___might_resched+0x10/0x10 [ 224.124248] [ T1097] ? _raw_spin_unlock+0x29/0x50 [ 224.124250] [ T1097] ? __might_fault+0x99/0x120 [ 224.124252] [ T1097] __might_fault+0xb9/0x120 [ 224.124254] [ T1097] ? __might_fault+0x99/0x120 [ 224.124256] [ T1097] ? __check_object_size+0x3f3/0x540 [ 224.124258] [ T1097] _copy_to_user+0x1e/0x80 [ 224.124261] [ T1097] relay_file_read+0x149/0x8a0 [ 224.124266] [ T1097] ? selinux_file_permission+0x36d/0x420 [ 224.124270] [ T1097] full_proxy_read+0x110/0x1d0 [ 224.124272] [ T1097] ? rw_verify_area+0x2f7/0x520 [ 224.124274] [ T1097] vfs_read+0x1bb/0xa20 [ 224.124277] [ T1097] ? __pfx___mutex_lock+0x10/0x10 [ 224.124279] [ T1097] ? __pfx_lock_release+0x10/0x10 [ 224.124282] [ T1097] ? __pfx_vfs_read+0x10/0x10 [ 224.124286] [ T1097] ? __fget_files+0x1ae/0x2e0 [ 224.124290] [ T1097] ksys_read+0xf6/0x1c0 [ 224.124293] [ T1097] ? __pfx_ksys_read+0x10/0x10 [ 224.124297] [ T1097] do_syscall_64+0x93/0x180 [ 224.124299] [ T1097] ? __pfx___rseq_handle_notify_resume+0x10/0x10 [ 224.124301] [ T1097] ? __pfx_blkcg_maybe_throttle_current+0x10/0x10 [ 224.124305] [ T1097] ? __pfx___rseq_handle_notify_resume+0x10/0x10 [ 224.124306] [ T1097] ? __pfx_blkcg_maybe_throttle_current+0x10/0x10 [ 224.124311] [ T1097] ? lockdep_hardirqs_on_prepare+0x16d/0x400 [ 224.124313] [ T1097] ? do_syscall_64+0x9f/0x180 [ 224.124315] [ T1097] ? lockdep_hardirqs_on+0x78/0x100 [ 224.124318] [ T1097] ? do_syscall_64+0x9f/0x180 [ 224.124319] [ T1097] ? lockdep_hardirqs_on_prepare+0x16d/0x400 [ 224.124322] [ T1097] ? do_syscall_64+0x9f/0x180 [ 224.124324] [ T1097] ? lockdep_hardirqs_on+0x78/0x100 [ 224.124326] [ T1097] ? do_syscall_64+0x9f/0x180 [ 224.124328] [ T1097] ? lockdep_hardirqs_on+0x78/0x100 [ 224.124330] [ T1097] ? do_syscall_64+0x9f/0x180 [ 224.124333] [ T1097] ? handle_mm_fault+0x39d/0x6b0 [ 224.124337] [ T1097] ? lockdep_hardirqs_on_prepare+0x16d/0x400 [ 224.124341] [ T1097] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 224.124344] [ T1097] RIP: 0033:0x7fdbd297be4a [ 224.124348] [ T1097] Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 28 58 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 82 58 f8 ff 48 8b [ 224.124350] [ T1097] RSP: 002b:00007fdbd2869d80 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 224.124354] [ T1097] RAX: ffffffffffffffda RBX: 00007fdbcc000bb0 RCX: 00007fdbd297be4a [ 224.124356] [ T1097] RDX: 0000000000080000 RSI: 00007fdbc3000000 RDI: 0000000000000004 [ 224.124357] [ T1097] RBP: 00007fdbd2869da0 R08: 0000000000000000 R09: 0000000000000000 [ 224.124358] [ T1097] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000 [ 224.124360] [ T1097] R13: 000055e34ca547e0 R14: 0000000000000000 R15: 00007fdbcc002c90 [ 224.124364] [ T1097] </TASK> [ 224.220675] [ T1041] sd 9:0:0:0: [sdd] Synchronizing SCSI cache