Re: [PATCH 2/2] block: avoid acquiring q->sysfs_lock while accessing sysfs attributes

Nilay Shroff <nilay@xxxxxxxxxxxxx> · Sun, 9 Feb 2025 19:11:11 +0530

On 2/9/25 5:11 PM, Ming Lei wrote:
> On Sat, Feb 08, 2025 at 06:26:38PM +0530, Nilay Shroff wrote:
>>
>>
>> On 2/8/25 4:11 PM, Ming Lei wrote:
>>> On Thu, Feb 06, 2025 at 07:24:02PM +0530, Nilay Shroff wrote:
>>>>
>>>>
>>>> On 2/5/25 9:23 PM, Christoph Hellwig wrote:
>>>>> On Wed, Feb 05, 2025 at 08:14:48PM +0530, Nilay Shroff wrote:
>>>>>> The sysfs attributes are already protected with sysfs/kernfs internal
>>>>>> locking. So acquiring q->sysfs_lock is not needed while accessing sysfs
>>>>>> attribute files. So this change helps avoid holding q->sysfs_lock while
>>>>>> accessing sysfs attribute files.
>>>>>
>>>>> the sysfs/kernfs locking only protects against other accesses using
>>>>> sysfs.  But that's not really the most interesting part here.  We
>>>>> also want to make sure nothing changes underneath in a way that
>>>>> could cause crashes (and maybe even torn information).
>>>>>
>>>>> We'll really need to audit what is accessed in each method and figure
>>>>> out what protects it.  Chances are that sysfs_lock provides that
>>>>> protection in some case right now, and chances are also very high
>>>>> that a lot of this is pretty broken.
>>>>>
>>>> Yes that's possible and so I audited all sysfs attributes which are 
>>>> currently protected using q->sysfs_lock and I found some interesting
>>>> facts. Please find below:
>>>>
>>>> 1. io_poll:
>>>>    Write to this attribute is ignored. So, we don't need q->sysfs_lock.
>>>>
>>>> 2. io_poll_delay:
>>>>    Write to this attribute is NOP, so we don't need q->sysfs_lock.
>>>>
>>>> 3. io_timeout:
>>>>    Write to this attribute updates q->rq_timeout and read of this attribute
>>>>    returns the value stored in q->rq_timeout Moreover, the q->rq_timeout is
>>>>    set only once when we init the queue (under blk_mq_init_allocated_queue())
>>>>    even before disk is added. So that means that we may not need to protect
>>>>    it with q->sysfs_lock.
>>>>
>>>> 4. nomerges:
>>>>    Write to this attribute file updates two q->flags : QUEUE_FLAG_NOMERGES 
>>>>    and QUEUE_FLAG_NOXMERGES. These flags are accessed during bio-merge which
>>>>    anyways doesn't run with q->sysfs_lock held. Moreover, the q->flags are 
>>>>    updated/accessed with bitops which are atomic. So, I believe, protecting
>>>>    it with q->sysfs_lock is not necessary.
>>>>
>>>> 5. nr_requests:
>>>>    Write to this attribute updates the tag sets and this could potentially
>>>>    race with __blk_mq_update_nr_hw_queues(). So I think we should really 
>>>>    protect it with q->tag_set->tag_list_lock instead of q->sysfs_lock.
>>>>
>>>> 6. read_ahead_kb:
>>>>    Write to this attribute file updates disk->bdi->ra_pages. The disk->bdi->
>>>>    ra_pages is also updated under queue_limits_commit_update() which runs 
>>>>    holding q->limits_lock; so I think this attribute file should be protected
>>>>    with q->limits_lock and protecting it with q->sysfs_lock is not necessary. 
>>>>    Maybe we should move it under the same sets of attribute files which today
>>>>    runs with q->limits_lock held.
>>>>
>>>> 7. rq_affinity:
>>>>    Write to this attribute file makes atomic updates to q->flags: QUEUE_FLAG_SAME_COMP
>>>>    and QUEUE_FLAG_SAME_FORCE. These flags are also accessed from blk_mq_complete_need_ipi()
>>>>    using test_bit macro. As read/write to q->flags uses bitops which are atomic, 
>>>>    protecting it with q->stsys_lock is not necessary.
>>>>
>>>> 8. scheduler:
>>>>    Write to this attribute actually updates q->elevator and the elevator change/switch 
>>>>    code expects that the q->sysfs_lock is held while we update the iosched to protect 
>>>>    against the simultaneous __blk_mq_update_nr_hw_queues update. So yes, this field needs 
>>>>    q->sysfs_lock protection.
>>>>
>>>>    However if we're thinking of protecting sched change/update using q->tag_sets->
>>>>    tag_list_lock (as discussed in another thread), then we may use q->tag_set->
>>>>    tag_list_lock instead of q->sysfs_lock here while reading/writing to this attribute
>>>>    file.
>>>
>>> This is one misuse of tag_list_lock, which is supposed to cover host
>>> wide change, and shouldn't be used for request queue level protection,
>>> which is exactly provided by q->sysfs_lock.
>>>
>> Yes I think Christoph was also pointed about the same but then assuming 
>> schedule/elevator update would be a rare operation it may not cause
>> a lot of contention. Having said that, I'm also fine creating another 
>> lock just to protect elevator changes and removing ->sysfs_lock from 
>> elevator code.
> 
> Then please use new lock.
Okay, I will replace q->sysfs_lock with another dedicated lock for synchronizing
elevator switch and nr_hw_queue update and that would help eliminate dependency 
between the q->q_usage_counter(io) (or freeze-lock) and the q->sysfs_lock. 
> 
>>
>>> Not mention it will cause ABBA deadlock over freeze lock, please see
>>> blk_mq_update_nr_hw_queues(). And it can't be used for protecting
>>> 'nr_requests' too.
>> I don't know how this might cause ABBA deadlock. The proposal here's to 
>> use ->tag_list_lock (instead of ->sysfs_lock) while updating scheduler 
>> attribute from sysfs as well as while we update the elevator through 
>> __blk_mq_update_nr_hw_queues().
>>
>> In each code path (either from sysfs attribute update or from nr_hw_queues 
>> update), we first acquire ->tag_list_lock and then freeze-lock.
>>
>> Do you see any code path where the above order might not be followed?  	
> 
> You patch 14ef49657ff3 ("block: fix nr_hw_queue update racing with disk addition/removal")
> has added one such warning:  blk_mq_sysfs_unregister() is called after
> queue freeze lock is grabbed from del_gendisk()
> 
> Also there are many such use cases in nvme: blk_mq_quiesce_tagset()/blk_mq_unquiesce_tagset()
> called after tagset is frozen.
> 
> More serious, driver may grab ->tag_list_lock in error recovery code for
> providing forward progress, you have to be careful wrt. using ->tag_list_lock,
> for example:
> 
> 	mutex_lock(->tag_list_lock)
> 	blk_mq_freeze_queue()		// If IO timeout happens, the driver timeout
> 								// handler stuck on mutex_lock(->tag_list_lock)

Ok got it! But lets wait for a bit if Christoph or others have any further comment before
I start making this change.

Thanks,
--Nilay