Re: [PATCH 7/7] block: don't grab q->debugfs_mutex

Nilay Shroff <nilay@xxxxxxxxxxxxx> · Mon, 10 Feb 2025 19:06:59 +0530



On 2/10/25 6:22 AM, Shinichiro Kawasaki wrote:
> On Feb 09, 2025 / 20:20, Ming Lei wrote:
>> All block internal state for dealing adding/removing debugfs entries
>> have been removed, and debugfs can sync everything for us in fs level,
>> so don't grab q->debugfs_mutex for adding/removing block internal debugfs
>> entries.
>>
>> Now q->debugfs_mutex is only used for blktrace, meantime move creating
>> queue debugfs dir code out of q->sysfs_lock. Both the two locks are
>> connected with queue freeze IO lock.  Then queue freeze IO lock chain
>> with debugfs lock is cut.
>>
>> The following lockdep report can be fixed:
>>
>> https://lore.kernel.org/linux-block/ougniadskhks7uyxguxihgeuh2pv4yaqv4q3emo4gwuolgzdt6@brotly74p6bs/
>>
>> Follows contexts which adds/removes debugfs entries:
>>
>> - update nr_hw_queues
>>
>> - add/remove disks
>>
>> - elevator switch
>>
>> - blktrace
>>
>> blktrace only adds entries under disk top directory, so we can ignore it,
>> because it can only work iff disk is added. Also nothing overlapped with
>> the other two contex, blktrace context is fine.
>>
>> Elevator switch is only allowed after disk is added, so there isn't race
>> with add/remove disk. blk_mq_update_nr_hw_queues() always restores to
>> previous elevator, so no race between these two. Elevator switch context
>> is fine.
>>
>> So far blk_mq_update_nr_hw_queues() doesn't hold debugfs lock for
>> adding/removing hctx entries, there might be race with add/remove disk,
>> which is just fine in reality:
>>
>> - blk_mq_update_nr_hw_queues() is usually for error recovery, and disk
>> won't be added/removed at the same time
>>
>> - even though there is race between the two contexts, it is just fine,
>> since hctx won't be freed until queue is dead
>>
>> - we never see reports in this area without holding debugfs in
>> blk_mq_update_nr_hw_queues()
>>
>> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@xxxxxxx>
>> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
> 

[...]

> 
> [  115.085704] [   T1023] run blktests block/002 at 2025-02-10 09:22:22
> [  115.383653] [   T1054] sd 9:0:0:0: [sdd] Synchronizing SCSI cache
> [  115.641933] [   T1055] scsi_debug:sdebug_driver_probe: scsi_debug: trim poll_queues to 0. poll_q/nr_hw = (0/1)
> [  115.642961] [   T1055] scsi host9: scsi_debug: version 0191 [20210520]
>                             dev_size_mb=8, opts=0x0, submit_queues=1, statistics=0
> [  115.646207] [   T1055] scsi 9:0:0:0: Direct-Access     Linux    scsi_debug       0191 PQ: 0 ANSI: 7
> [  115.648225] [      C0] scsi 9:0:0:0: Power-on or device reset occurred
> [  115.654243] [   T1055] sd 9:0:0:0: Attached scsi generic sg3 type 0
> [  115.656248] [    T100] sd 9:0:0:0: [sdd] 16384 512-byte logical blocks: (8.39 MB/8.00 MiB)
> [  115.658403] [    T100] sd 9:0:0:0: [sdd] Write Protect is off
> [  115.659125] [    T100] sd 9:0:0:0: [sdd] Mode Sense: 73 00 10 08
> [  115.661621] [    T100] sd 9:0:0:0: [sdd] Write cache: enabled, read cache: enabled, supports DPO and FUA
> [  115.669276] [    T100] sd 9:0:0:0: [sdd] permanent stream count = 5
> [  115.673375] [    T100] sd 9:0:0:0: [sdd] Preferred minimum I/O size 512 bytes
> [  115.673974] [    T100] sd 9:0:0:0: [sdd] Optimal transfer size 524288 bytes
> [  115.710112] [    T100] sd 9:0:0:0: [sdd] Attached SCSI disk
> 
> [  116.464802] [   T1079] ======================================================
> [  116.465540] [   T1079] WARNING: possible circular locking dependency detected
> [  116.466107] [   T1079] 6.14.0-rc1+ #253 Not tainted
> [  116.466581] [   T1079] ------------------------------------------------------
> [  116.467141] [   T1079] blktrace/1079 is trying to acquire lock:
> [  116.467708] [   T1079] ffff88810539d1e0 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x99/0x120
> [  116.468439] [   T1079] 
>                           but task is already holding lock:
> [  116.469052] [   T1079] ffff88810a5fd758 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: relay_file_read+0xa3/0x8a0
> [  116.469901] [   T1079] 
>                           which lock already depends on the new lock.
> 
> [  116.470762] [   T1079] 
>                           the existing dependency chain (in reverse order) is:
> [  116.473187] [   T1079] 
>                           -> #5 (&sb->s_type->i_mutex_key#3){++++}-{4:4}:
> [  116.475670] [   T1079]        down_read+0x9b/0x470
> [  116.477001] [   T1079]        lookup_one_unlocked+0xe9/0x120
> [  116.478333] [   T1079]        lookup_positive_unlocked+0x1d/0x90
> [  116.479648] [   T1079]        debugfs_lookup+0x47/0xa0
> [  116.480833] [   T1079]        blk_mq_debugfs_unregister_sched_hctx+0x23/0x50
> [  116.482215] [   T1079]        blk_mq_exit_sched+0xb6/0x2b0
> [  116.483466] [   T1079]        elevator_switch+0x12a/0x4b0
> [  116.484676] [   T1079]        elv_iosched_store+0x29f/0x380
> [  116.485841] [   T1079]        queue_attr_store+0x313/0x480
> [  116.487078] [   T1079]        kernfs_fop_write_iter+0x39e/0x5a0
> [  116.488358] [   T1079]        vfs_write+0x5f9/0xe90
> [  116.489460] [   T1079]        ksys_write+0xf6/0x1c0
> [  116.490582] [   T1079]        do_syscall_64+0x93/0x180
> [  116.491694] [   T1079]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  116.492996] [   T1079] 
>                           -> #4 (&eq->sysfs_lock){+.+.}-{4:4}:
> [  116.495135] [   T1079]        __mutex_lock+0x1aa/0x1360
> [  116.496363] [   T1079]        elevator_switch+0x11f/0x4b0
> [  116.497499] [   T1079]        elv_iosched_store+0x29f/0x380
> [  116.498660] [   T1079]        queue_attr_store+0x313/0x480
> [  116.499752] [   T1079]        kernfs_fop_write_iter+0x39e/0x5a0
> [  116.500884] [   T1079]        vfs_write+0x5f9/0xe90
> [  116.501964] [   T1079]        ksys_write+0xf6/0x1c0
> [  116.503056] [   T1079]        do_syscall_64+0x93/0x180
> [  116.504194] [   T1079]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  116.505356] [   T1079] 
In the above dependency chain, I see that thread #5 is waiting on &sb->s_type->i_mutex_key
after acquiring q->sysfs_lock and eq->sysfs_lock. And thread #4 would have acquired 
q->sysfs_lock and now pending on eq->sysfs_lock. But then how is it possible that two threads
are able to acquire q->sysfs_lock at the same time (assuming this is for the same request_queue). 
Is this a false-positive report from lockdep? Or am I missing something?

Thanks,
--Nilay