Re: [PATCH 1/2] block: fix lock ordering between the queue ->sysfs_lock and freeze-lock

Nilay Shroff <nilay@xxxxxxxxxxxxx> · Fri, 7 Feb 2025 23:32:37 +0530



On 2/7/25 5:29 PM, Ming Lei wrote:
> On Thu, Feb 06, 2025 at 06:52:36PM +0530, Nilay Shroff wrote:
>>
>>
>> On 2/5/25 9:29 PM, Christoph Hellwig wrote:
>>> On Wed, Feb 05, 2025 at 08:14:47PM +0530, Nilay Shroff wrote:
>>>>  
>>>>  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>> @@ -5006,8 +5008,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>>>  		return;
>>>>  
>>>>  	memflags = memalloc_noio_save();
>>>> -	list_for_each_entry(q, &set->tag_list, tag_set_list)
>>>> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>>>> +		mutex_lock(&q->sysfs_lock);
>>>
>>> This now means we hold up to number of request queues sysfs_lock
>>> at the same time.  I doubt lockdep will be happy about this.
>>> Did you test this patch with a multi-namespace nvme device or
>>> a multi-LU per host SCSI setup?
>>>
>> Yeah I tested with a multi namespace NVMe disk and lockdep didn't 
>> complain. Agreed we need to hold up q->sysfs_lock for multiple 
>> request queues at the same time and that may not be elegant, but 
>> looking at the mess in __blk_mq_update_nr_hw_queues we may not
>> have other choice which could help correct the lock order.
> 
> All q->sysfs_lock instance actually shares same lock class, so this way
> should have triggered double lock warning, please see mutex_init().
> 
Well, my understanding about lockdep is that even though all q->sysfs_lock
instances share the same lock class key, lockdep differentiates locks 
based on their memory address. Since each instance of &q->sysfs_lock has 
got different memory address, lockdep treat each of them as distinct locks 
and IMO, that avoids triggering double lock warning.

> The ->sysfs_lock involved in this patch looks only for sync elevator
> switch with reallocating hctxs, so I am wondering why not add new
> dedicated lock for this purpose only?
> 
> Then we needn't to worry about its dependency with q->q_usage_counter(io)?
> 
Yes that should be possible but then as Christoph suggested, __blk_mq_update_
nr_hw_queues already runs holding tag_list_lock and so why shouldn't we use
the same tag_list_lock even for sched/elevator updates? That way we may avoid
adding another new lock.

Thanks,
--Nilay