Hi, (Jakub, can you please see below, I wonder which you prefer) > > I found that we can hit this same ABBA deadlock within the nl80211 code > before ever even calling into the vendor doit() function. Hmm. > The issue I found > is caused by the way we unlock the RTNL mutex. Here is the call flow that > leads to the deadlock: > > Thread 1 Thread 2 > nl80211_pre_doit(): > rtnl_lock() > wiphy_lock() nl80211_pre_doit(): > rtnl_lock() // blocked by Thread 1 > rtnl_unlock(): > netdev_run_todo(): > __rtnl_unlock() > <got RTNL lock> > wiphy_lock() // blocked by Thread 1 > rtnl_lock(); // DEADLOCK > doit() > nl80211_post_doit(): > wiphy_unlock(); > > Basically, unlocking the RTNL within netdev_run_todo() gives another thread > that is waiting for the RTNL in nl80211_pre_doit() a chance to grab the > RTNL lock leading to the deadlock. I found that there are multiple > instances where rtnl_lock() is called within netdev_run_todo(): a couple of > times inside netdev_wait_allrefs() and directly by netdev_run_todo(). Have you actually observed this in practice? It's true, but dynamically this only happens if you get into the while (!list_empty(&list)) { ... } code in netdev_run_todo(). Which in itself can only be true if a netdev was unregistered and netdev_run_todo() didn't run yet. Now, since normally we go through rtnl_unlock(), it's highly likely that we get into rtnl_lock() with the todo list being empty (but not impossible, read on), so then rtnl_unlock()/netdev_run_todo() won't be getting into this branch, and then the deadlock cannot happen. Now, it might be possible somewhere in the tree to unregister a netdev and then unlock using __rtnl_unlock(), so the todo list isn't processed until the next time, but __rtnl_unlock() isn't exported and all the users I found didn't seem to cause this problem. On the other hand, clearly people thought it was worth worrying about, there are already *two* almost identical implementations of avoiding this problem: - rtnl_lock_unregistering - rtnl_lock_unregistering_all (identical except for the netns list they use, partial vs. all). So we can avoid the potential deadlock in cfg80211 in a few ways: 1) export rtnl_lock_unregistering_all() or maybe a variant after refactoring the two versions, to allow cfg80211 to use it, that way netdev_run_todo() can never have a non-empty todo list 2) export __rtnl_unlock() so cfg80211 can avoid running netdev_run_todo() in the unlock, personally I like this less because it might encourage random drivers to use it 3) completely rework cfg80211's locking, adding a separate mutex for the wiphy list so we don't need to acquire the RTNL at all here (unless the ops need it, but there's no issue if we don't drop it), something like https://p.sipsolutions.net/27d08e1f5881a793.txt I think I'm happy with 3) now (even if it took a couple of hours), so I think we can go with it, just need to go through all the possibilities. johannes