Re: [PATCH 1/2] mt76: mt7921: fix a leftover race in runtime-pm

Kalle Valo <kvalo@xxxxxxxxxx> · Tue, 11 Jan 2022 17:00:50 +0200

Felix Fietkau <nbd@xxxxxxxx> writes:

> On 2022-01-11 11:35, Kalle Valo wrote:
>> Lorenzo Bianconi <lorenzo@xxxxxxxxxx> writes:
>>
>>> Fix a possible race in mt7921_pm_power_save_work() if rx/tx napi
>>> schedules ps_work and we are currently accessing device register
>>> on a different cpu.
>>>
>>> Fixes: 1d8efc741df8 ("mt76: mt7921: introduce Runtime PM support")
>>> Signed-off-by: Lorenzo Bianconi <lorenzo@xxxxxxxxxx>
>>> ---
>>>  drivers/net/wireless/mediatek/mt76/mt7921/mac.c | 8 ++++++++
>>>  1 file changed, 8 insertions(+)
>>>
>>> diff --git a/drivers/net/wireless/mediatek/mt76/mt7921/mac.c b/drivers/net/wireless/mediatek/mt76/mt7921/mac.c
>>> index defef3496246..0744f6e42ba3 100644
>>> --- a/drivers/net/wireless/mediatek/mt76/mt7921/mac.c
>>> +++ b/drivers/net/wireless/mediatek/mt76/mt7921/mac.c
>>> @@ -1553,6 +1553,14 @@ void mt7921_pm_power_save_work(struct work_struct *work)
>>>  	    dev->fw_assert)
>>>  		goto out;
>>>  +	if (mutex_is_locked(&dev->mt76.mutex))
>>> +		/* if mt76 mutex is held we should not put the device
>>> +		 * to sleep since we are currently accessing device
>>> +		 * register map. We need to wait for the next power_save
>>> +		 * trigger.
>>> +		 */
>>> +		goto out;
>>
>> This looks fishy to me. What protects the case when ps_work is run first
>> and at the same time another cpu starts accessing the registers?
>>
>> Do note that I didn't check the code, so I might be missing something.
> For atomic context there is a locked counter pm->wake.count which is
> used to prevent the device from going to sleep. If the device is
> sleeping already on irq/tx, it is woken up and the function is
> rescheduled. The device is never put to sleep while the wake count is
> non-zero.
>
> For non-atomic context, the mutex is always held. There is a wrapper
> for acquiring and releasing the mutex, which cancels the work after
> acquiring the mutex and reschedules the delayed work after updating
> the last activity timestamp (which gets checked here after checking
> the mutex).
>
> The corner case that needs this mutex check here is when the work was
> scheduled again after processing some NAPI, tx or irq activity and the
> work gets run all while another cpu is in the middle of a long running
> non-atomic activity that holds the mutex.
>
> For that we really do need the simple mutex_is_locked check, since
> actually holding the lock here would cause a deadlock with the
> mutex_acquire wrapper.

Very good, thanks Felix and Lorenzo for explaining all this. I just
always get wary when I see mutex_is_locked() being used.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches