Re: [PATCH 1/2] block: fix lock ordering between the queue ->sysfs_lock and freeze-lock

Nilay Shroff <nilay@xxxxxxxxxxxxx> · Thu, 6 Feb 2025 18:52:36 +0530



On 2/5/25 9:29 PM, Christoph Hellwig wrote:
> On Wed, Feb 05, 2025 at 08:14:47PM +0530, Nilay Shroff wrote:
>>  
>>  static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>> @@ -5006,8 +5008,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
>>  		return;
>>  
>>  	memflags = memalloc_noio_save();
>> -	list_for_each_entry(q, &set->tag_list, tag_set_list)
>> +	list_for_each_entry(q, &set->tag_list, tag_set_list) {
>> +		mutex_lock(&q->sysfs_lock);
> 
> This now means we hold up to number of request queues sysfs_lock
> at the same time.  I doubt lockdep will be happy about this.
> Did you test this patch with a multi-namespace nvme device or
> a multi-LU per host SCSI setup?
> 
Yeah I tested with a multi namespace NVMe disk and lockdep didn't 
complain. Agreed we need to hold up q->sysfs_lock for multiple 
request queues at the same time and that may not be elegant, but 
looking at the mess in __blk_mq_update_nr_hw_queues we may not
have other choice which could help correct the lock order.

> I suspect the answer here is to (ab)use the tag_list_lock for
> scheduler updates - while the scope is too broad for just
> changing it on a single queue it is a rate operation and it
> solves the mess in __blk_mq_update_nr_hw_queues.
> 
Yes this is probably a good idea, that instead of using q->sysfs_lock 
we may depend on q->tag_set->tag_list_lock here for sched/elevator updates
as a fact that __blk_mq_update_nr_hw_queues already runs with tag_list_lock
held. But then it also requires using the same tag_list_lock instead of 
current sysfs_lock while we update the scheduler from sysfs. But that's
a trivial change.

Thanks,
--Nilay