Re: [PATCH net] ravb: Fix races between ravb_tx_timeout_work() and net related ops

Sergey Shtylyov <s.shtylyov@xxxxxx> · Wed, 18 Oct 2023 20:27:13 +0300

On 10/18/23 12:39 PM, Yoshihiro Shimoda wrote:
[...]
>>> Fix races between ravb_tx_timeout_work() and functions of net_device_ops
>>> and ethtool_ops by using rtnl_trylock() and rtnl_unlock(). Note that
>>> since ravb_close() is under the rtnl lock and calls cancel_work_sync(),
>>> ravb_tx_timeout_work() calls rtnl_trylock() to avoid a deadlock.
>>
>>    I don't quite follow... how calling cancel_work_sync() is a problem?
>> I thought the problem was that unregister_netdev() can be called with
>> the TX timeout work still pending? And, more generally, shouldn't we
>> protect against the TX timeout work being executed on a different CPU
>> than the {net_device|ethtool}_ops methods are being executed (is that
>> possible?)?
> 
> __dev_close_many() in net/core/dev.c calls ASSERT_RTNL() and ops->ndo_stop().
> So, the ravb_close() is under rtnl lock. While locking the rtnl, it's
> possible to call ravb_tx_timeout_work() on other CPU. In such a case,
> it's possible to cause a deadlock between ravb_close() and ravb_tx_timeout_work()
> 
> CPU0				CPU1
> 				ravb_tx_timeout()
> 				schedule_work()
> ...
> __dev_close_many()
> // this is under rtnl lock
> ravb_close()
> cancel_work_sync()
> 				ravb_tx_timeout_work()
> 				rtnl_lock()
> 				// this is possible to cause a deadlock

   Ah, cancel_work_sync() means we have to wait for the work to
finish -- indeed a deadlock is possiblet then. 
>>    I also had a suspicion that we still miss the spinlock calls in
>> ravb_tx_timeout_work() but nothing in particular jumped at me...

   We mainly need to protect against the interrupts in this case...

>> mind looking into that? :-)
> 
> Yes, perhaps we have to check it somehow...

   Unfortunately, I don't seem to have no bandwidth to do that myself...

[...]

MBR, Sergey