Re: [PATCH 7/7] block: don't grab q->debugfs_mutex

Shinichiro Kawasaki <shinichiro.kawasaki@xxxxxxx> · Mon, 10 Feb 2025 00:52:21 +0000

On Feb 09, 2025 / 20:20, Ming Lei wrote:
> All block internal state for dealing adding/removing debugfs entries
> have been removed, and debugfs can sync everything for us in fs level,
> so don't grab q->debugfs_mutex for adding/removing block internal debugfs
> entries.
> 
> Now q->debugfs_mutex is only used for blktrace, meantime move creating
> queue debugfs dir code out of q->sysfs_lock. Both the two locks are
> connected with queue freeze IO lock.  Then queue freeze IO lock chain
> with debugfs lock is cut.
> 
> The following lockdep report can be fixed:
> 
> https://lore.kernel.org/linux-block/ougniadskhks7uyxguxihgeuh2pv4yaqv4q3emo4gwuolgzdt6@brotly74p6bs/
> 
> Follows contexts which adds/removes debugfs entries:
> 
> - update nr_hw_queues
> 
> - add/remove disks
> 
> - elevator switch
> 
> - blktrace
> 
> blktrace only adds entries under disk top directory, so we can ignore it,
> because it can only work iff disk is added. Also nothing overlapped with
> the other two contex, blktrace context is fine.
> 
> Elevator switch is only allowed after disk is added, so there isn't race
> with add/remove disk. blk_mq_update_nr_hw_queues() always restores to
> previous elevator, so no race between these two. Elevator switch context
> is fine.
> 
> So far blk_mq_update_nr_hw_queues() doesn't hold debugfs lock for
> adding/removing hctx entries, there might be race with add/remove disk,
> which is just fine in reality:
> 
> - blk_mq_update_nr_hw_queues() is usually for error recovery, and disk
> won't be added/removed at the same time
> 
> - even though there is race between the two contexts, it is just fine,
> since hctx won't be freed until queue is dead
> 
> - we never see reports in this area without holding debugfs in
> blk_mq_update_nr_hw_queues()
> 
> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@xxxxxxx>
> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>

Ming, thank you for this quick action. I applied this series on top of
v6.14-rc1 kernel and ran the block/002 test case. Unfortunately, still if fails
occasionally with the lockdep "WARNING: possible circular locking dependency
detected" below. Now debugfs_mutex is not reported as one of the dependent
locks, then I think this fix is working as expected. Instead, eq->sysfs_lock
creates similar dependency. My mere guess is that this patch avoids one
dependency, but still another dependency is left.

[  115.085704] [   T1023] run blktests block/002 at 2025-02-10 09:22:22
[  115.383653] [   T1054] sd 9:0:0:0: [sdd] Synchronizing SCSI cache
[  115.641933] [   T1055] scsi_debug:sdebug_driver_probe: scsi_debug: trim poll_queues to 0. poll_q/nr_hw = (0/1)
[  115.642961] [   T1055] scsi host9: scsi_debug: version 0191 [20210520]
                            dev_size_mb=8, opts=0x0, submit_queues=1, statistics=0
[  115.646207] [   T1055] scsi 9:0:0:0: Direct-Access     Linux    scsi_debug       0191 PQ: 0 ANSI: 7
[  115.648225] [      C0] scsi 9:0:0:0: Power-on or device reset occurred
[  115.654243] [   T1055] sd 9:0:0:0: Attached scsi generic sg3 type 0
[  115.656248] [    T100] sd 9:0:0:0: [sdd] 16384 512-byte logical blocks: (8.39 MB/8.00 MiB)
[  115.658403] [    T100] sd 9:0:0:0: [sdd] Write Protect is off
[  115.659125] [    T100] sd 9:0:0:0: [sdd] Mode Sense: 73 00 10 08
[  115.661621] [    T100] sd 9:0:0:0: [sdd] Write cache: enabled, read cache: enabled, supports DPO and FUA
[  115.669276] [    T100] sd 9:0:0:0: [sdd] permanent stream count = 5
[  115.673375] [    T100] sd 9:0:0:0: [sdd] Preferred minimum I/O size 512 bytes
[  115.673974] [    T100] sd 9:0:0:0: [sdd] Optimal transfer size 524288 bytes
[  115.710112] [    T100] sd 9:0:0:0: [sdd] Attached SCSI disk

[  116.464802] [   T1079] ======================================================
[  116.465540] [   T1079] WARNING: possible circular locking dependency detected
[  116.466107] [   T1079] 6.14.0-rc1+ #253 Not tainted
[  116.466581] [   T1079] ------------------------------------------------------
[  116.467141] [   T1079] blktrace/1079 is trying to acquire lock:
[  116.467708] [   T1079] ffff88810539d1e0 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x99/0x120
[  116.468439] [   T1079] 
                          but task is already holding lock:
[  116.469052] [   T1079] ffff88810a5fd758 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: relay_file_read+0xa3/0x8a0
[  116.469901] [   T1079] 
                          which lock already depends on the new lock.

[  116.470762] [   T1079] 
                          the existing dependency chain (in reverse order) is:
[  116.473187] [   T1079] 
                          -> #5 (&sb->s_type->i_mutex_key#3){++++}-{4:4}:
[  116.475670] [   T1079]        down_read+0x9b/0x470
[  116.477001] [   T1079]        lookup_one_unlocked+0xe9/0x120
[  116.478333] [   T1079]        lookup_positive_unlocked+0x1d/0x90
[  116.479648] [   T1079]        debugfs_lookup+0x47/0xa0
[  116.480833] [   T1079]        blk_mq_debugfs_unregister_sched_hctx+0x23/0x50
[  116.482215] [   T1079]        blk_mq_exit_sched+0xb6/0x2b0
[  116.483466] [   T1079]        elevator_switch+0x12a/0x4b0
[  116.484676] [   T1079]        elv_iosched_store+0x29f/0x380
[  116.485841] [   T1079]        queue_attr_store+0x313/0x480
[  116.487078] [   T1079]        kernfs_fop_write_iter+0x39e/0x5a0
[  116.488358] [   T1079]        vfs_write+0x5f9/0xe90
[  116.489460] [   T1079]        ksys_write+0xf6/0x1c0
[  116.490582] [   T1079]        do_syscall_64+0x93/0x180
[  116.491694] [   T1079]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  116.492996] [   T1079] 
                          -> #4 (&eq->sysfs_lock){+.+.}-{4:4}:
[  116.495135] [   T1079]        __mutex_lock+0x1aa/0x1360
[  116.496363] [   T1079]        elevator_switch+0x11f/0x4b0
[  116.497499] [   T1079]        elv_iosched_store+0x29f/0x380
[  116.498660] [   T1079]        queue_attr_store+0x313/0x480
[  116.499752] [   T1079]        kernfs_fop_write_iter+0x39e/0x5a0
[  116.500884] [   T1079]        vfs_write+0x5f9/0xe90
[  116.501964] [   T1079]        ksys_write+0xf6/0x1c0
[  116.503056] [   T1079]        do_syscall_64+0x93/0x180
[  116.504194] [   T1079]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  116.505356] [   T1079] 
                          -> #3 (&q->q_usage_counter(io)#2){++++}-{0:0}:
[  116.507272] [   T1079]        blk_mq_submit_bio+0x19b8/0x2070
[  116.508341] [   T1079]        __submit_bio+0x36b/0x6d0
[  116.509351] [   T1079]        submit_bio_noacct_nocheck+0x542/0xca0
[  116.510447] [   T1079]        iomap_readahead+0x4c4/0x7e0
[  116.511485] [   T1079]        read_pages+0x198/0xaf0
[  116.512468] [   T1079]        page_cache_ra_order+0x4d3/0xb50
[  116.513497] [   T1079]        filemap_get_pages+0x2c7/0x1850
[  116.514511] [   T1079]        filemap_read+0x31d/0xc30
[  116.515492] [   T1079]        xfs_file_buffered_read+0x1e9/0x2a0 [xfs]
[  116.517374] [   T1079]        xfs_file_read_iter+0x25f/0x4a0 [xfs]
[  116.519428] [   T1079]        vfs_read+0x6bc/0xa20
[  116.520904] [   T1079]        ksys_read+0xf6/0x1c0
[  116.522378] [   T1079]        do_syscall_64+0x93/0x180
[  116.523840] [   T1079]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  116.525467] [   T1079] 
                          -> #2 (mapping.invalidate_lock#2){++++}-{4:4}:
[  116.528191] [   T1079]        down_read+0x9b/0x470
[  116.529609] [   T1079]        filemap_fault+0x231/0x2ac0
[  116.531033] [   T1079]        __do_fault+0xf4/0x5d0
[  116.532461] [   T1079]        do_fault+0x965/0x11d0
[  116.533859] [   T1079]        __handle_mm_fault+0x1060/0x1f40
[  116.535343] [   T1079]        handle_mm_fault+0x21a/0x6b0
[  116.536915] [   T1079]        do_user_addr_fault+0x322/0xaa0
[  116.538409] [   T1079]        exc_page_fault+0x7a/0x110
[  116.539811] [   T1079]        asm_exc_page_fault+0x22/0x30
[  116.541247] [   T1079] 
                          -> #1 (&vma->vm_lock->lock){++++}-{4:4}:
[  116.543820] [   T1079]        down_write+0x8d/0x200
[  116.545202] [   T1079]        vma_link+0x1ff/0x590
[  116.546588] [   T1079]        insert_vm_struct+0x15a/0x340
[  116.548000] [   T1079]        alloc_bprm+0x626/0xbf0
[  116.549396] [   T1079]        kernel_execve+0x85/0x2f0
[  116.550784] [   T1079]        call_usermodehelper_exec_async+0x21b/0x430
[  116.552352] [   T1079]        ret_from_fork+0x30/0x70
[  116.553751] [   T1079]        ret_from_fork_asm+0x1a/0x30
[  116.555164] [   T1079] 
                          -> #0 (&mm->mmap_lock){++++}-{4:4}:
[  116.557848] [   T1079]        __lock_acquire+0x2f52/0x5ea0
[  116.559293] [   T1079]        lock_acquire+0x1b1/0x540
[  116.560678] [   T1079]        __might_fault+0xb9/0x120
[  116.562051] [   T1079]        _copy_to_user+0x1e/0x80
[  116.563446] [   T1079]        relay_file_read+0x149/0x8a0
[  116.564851] [   T1079]        full_proxy_read+0x110/0x1d0
[  116.566262] [   T1079]        vfs_read+0x1bb/0xa20
[  116.567591] [   T1079]        ksys_read+0xf6/0x1c0
[  116.568909] [   T1079]        do_syscall_64+0x93/0x180
[  116.570283] [   T1079]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  116.571781] [   T1079] 
                          other info that might help us debug this:

[  116.575369] [   T1079] Chain exists of:
                            &mm->mmap_lock --> &eq->sysfs_lock --> &sb->s_type->i_mutex_key#3

[  116.579323] [   T1079]  Possible unsafe locking scenario:

[  116.580961] [   T1079]        CPU0                    CPU1
[  116.581866] [   T1079]        ----                    ----
[  116.582737] [   T1079]   lock(&sb->s_type->i_mutex_key#3);
[  116.583611] [   T1079]                                lock(&eq->sysfs_lock);
[  116.584596] [   T1079]                                lock(&sb->s_type->i_mutex_key#3);
[  116.585646] [   T1079]   rlock(&mm->mmap_lock);
[  116.586453] [   T1079] 
                           *** DEADLOCK ***

[  116.588484] [   T1079] 2 locks held by blktrace/1079:
[  116.589318] [   T1079]  #0: ffff888101390af8 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0x233/0x2f0
[  116.590461] [   T1079]  #1: ffff88810a5fd758 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: relay_file_read+0xa3/0x8a0
[  116.591720] [   T1079] 
                          stack backtrace:
[  116.593166] [   T1079] CPU: 0 UID: 0 PID: 1079 Comm: blktrace Not tainted 6.14.0-rc1+ #253
[  116.593169] [   T1079] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
[  116.593172] [   T1079] Call Trace:
[  116.593176] [   T1079]  <TASK>
[  116.593178] [   T1079]  dump_stack_lvl+0x6a/0x90
[  116.593186] [   T1079]  print_circular_bug.cold+0x1e0/0x274
[  116.593191] [   T1079]  check_noncircular+0x306/0x3f0
[  116.593195] [   T1079]  ? __pfx_check_noncircular+0x10/0x10
[  116.593200] [   T1079]  ? lockdep_lock+0xca/0x1c0
[  116.593202] [   T1079]  ? __pfx_lockdep_lock+0x10/0x10
[  116.593206] [   T1079]  __lock_acquire+0x2f52/0x5ea0
[  116.593211] [   T1079]  ? lockdep_unlock+0xf1/0x250
[  116.593213] [   T1079]  ? __pfx___lock_acquire+0x10/0x10
[  116.593216] [   T1079]  ? lock_acquire+0x1b1/0x540
[  116.593220] [   T1079]  lock_acquire+0x1b1/0x540
[  116.593222] [   T1079]  ? __might_fault+0x99/0x120
[  116.593226] [   T1079]  ? __pfx_lock_acquire+0x10/0x10
[  116.593228] [   T1079]  ? lock_is_held_type+0xd5/0x130
[  116.593234] [   T1079]  ? __pfx___might_resched+0x10/0x10
[  116.593239] [   T1079]  ? _raw_spin_unlock+0x29/0x50
[  116.593243] [   T1079]  ? __might_fault+0x99/0x120
[  116.593245] [   T1079]  __might_fault+0xb9/0x120
[  116.593247] [   T1079]  ? __might_fault+0x99/0x120
[  116.593249] [   T1079]  ? __check_object_size+0x3f3/0x540
[  116.593253] [   T1079]  _copy_to_user+0x1e/0x80
[  116.593256] [   T1079]  relay_file_read+0x149/0x8a0
[  116.593261] [   T1079]  ? selinux_file_permission+0x36d/0x420
[  116.593270] [   T1079]  full_proxy_read+0x110/0x1d0
[  116.593272] [   T1079]  ? rw_verify_area+0x2f7/0x520
[  116.593275] [   T1079]  vfs_read+0x1bb/0xa20
[  116.593278] [   T1079]  ? __pfx___mutex_lock+0x10/0x10
[  116.593280] [   T1079]  ? __pfx_lock_release+0x10/0x10
[  116.593283] [   T1079]  ? lock_release+0x45b/0x7a0
[  116.593285] [   T1079]  ? __pfx_vfs_read+0x10/0x10
[  116.593289] [   T1079]  ? __fget_files+0x1ae/0x2e0
[  116.593294] [   T1079]  ksys_read+0xf6/0x1c0
[  116.593296] [   T1079]  ? __pfx_ksys_read+0x10/0x10
[  116.593300] [   T1079]  do_syscall_64+0x93/0x180
[  116.593303] [   T1079]  ? lockdep_hardirqs_on_prepare+0x16d/0x400
[  116.593306] [   T1079]  ? do_syscall_64+0x9f/0x180
[  116.593308] [   T1079]  ? lockdep_hardirqs_on+0x78/0x100
[  116.593310] [   T1079]  ? do_syscall_64+0x9f/0x180
[  116.593313] [   T1079]  ? __pfx_vm_mmap_pgoff+0x10/0x10
[  116.593317] [   T1079]  ? __fget_files+0x1ae/0x2e0
[  116.593321] [   T1079]  ? lockdep_hardirqs_on_prepare+0x16d/0x400
[  116.593324] [   T1079]  ? do_syscall_64+0x9f/0x180
[  116.593326] [   T1079]  ? lockdep_hardirqs_on+0x78/0x100
[  116.593328] [   T1079]  ? do_syscall_64+0x9f/0x180
[  116.593331] [   T1079]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  116.593334] [   T1079] RIP: 0033:0x7f9586b50e4a
[  116.593338] [   T1079] Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 28 58 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 82 58 f8 ff 48 8b
[  116.593341] [   T1079] RSP: 002b:00007f9586a3ed80 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[  116.593348] [   T1079] RAX: ffffffffffffffda RBX: 00007f9580000bb0 RCX: 00007f9586b50e4a
[  116.593349] [   T1079] RDX: 0000000000080000 RSI: 00007f9584800000 RDI: 0000000000000004
[  116.593351] [   T1079] RBP: 00007f9586a3eda0 R08: 0000000000000000 R09: 0000000000000000
[  116.593352] [   T1079] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
[  116.593353] [   T1079] R13: 000056218e3f97e0 R14: 0000000000000000 R15: 00007f9580002c90
[  116.593359] [   T1079]  </TASK>
[  116.687238] [   T1023] sd 9:0:0:0: [sdd] Synchronizing SCSI cache