On 2/5/25 9:29 PM, Christoph Hellwig wrote: > On Wed, Feb 05, 2025 at 08:14:47PM +0530, Nilay Shroff wrote: >> >> static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, >> @@ -5006,8 +5008,10 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, >> return; >> >> memflags = memalloc_noio_save(); >> - list_for_each_entry(q, &set->tag_list, tag_set_list) >> + list_for_each_entry(q, &set->tag_list, tag_set_list) { >> + mutex_lock(&q->sysfs_lock); > > This now means we hold up to number of request queues sysfs_lock > at the same time. I doubt lockdep will be happy about this. > Did you test this patch with a multi-namespace nvme device or > a multi-LU per host SCSI setup? > Yeah I tested with a multi namespace NVMe disk and lockdep didn't complain. Agreed we need to hold up q->sysfs_lock for multiple request queues at the same time and that may not be elegant, but looking at the mess in __blk_mq_update_nr_hw_queues we may not have other choice which could help correct the lock order. > I suspect the answer here is to (ab)use the tag_list_lock for > scheduler updates - while the scope is too broad for just > changing it on a single queue it is a rate operation and it > solves the mess in __blk_mq_update_nr_hw_queues. > Yes this is probably a good idea, that instead of using q->sysfs_lock we may depend on q->tag_set->tag_list_lock here for sched/elevator updates as a fact that __blk_mq_update_nr_hw_queues already runs with tag_list_lock held. But then it also requires using the same tag_list_lock instead of current sysfs_lock while we update the scheduler from sysfs. But that's a trivial change. Thanks, --Nilay