On 10/18/23 12:39 PM, Yoshihiro Shimoda wrote: [...] >>> Fix races between ravb_tx_timeout_work() and functions of net_device_ops >>> and ethtool_ops by using rtnl_trylock() and rtnl_unlock(). Note that >>> since ravb_close() is under the rtnl lock and calls cancel_work_sync(), >>> ravb_tx_timeout_work() calls rtnl_trylock() to avoid a deadlock. >> >> I don't quite follow... how calling cancel_work_sync() is a problem? >> I thought the problem was that unregister_netdev() can be called with >> the TX timeout work still pending? And, more generally, shouldn't we >> protect against the TX timeout work being executed on a different CPU >> than the {net_device|ethtool}_ops methods are being executed (is that >> possible?)? > > __dev_close_many() in net/core/dev.c calls ASSERT_RTNL() and ops->ndo_stop(). > So, the ravb_close() is under rtnl lock. While locking the rtnl, it's > possible to call ravb_tx_timeout_work() on other CPU. In such a case, > it's possible to cause a deadlock between ravb_close() and ravb_tx_timeout_work() > > CPU0 CPU1 > ravb_tx_timeout() > schedule_work() > ... > __dev_close_many() > // this is under rtnl lock > ravb_close() > cancel_work_sync() > ravb_tx_timeout_work() > rtnl_lock() > // this is possible to cause a deadlock Ah, cancel_work_sync() means we have to wait for the work to finish -- indeed a deadlock is possiblet then. >> I also had a suspicion that we still miss the spinlock calls in >> ravb_tx_timeout_work() but nothing in particular jumped at me... We mainly need to protect against the interrupts in this case... >> mind looking into that? :-) > > Yes, perhaps we have to check it somehow... Unfortunately, I don't seem to have no bandwidth to do that myself... [...] MBR, Sergey