On Fri, 2022-03-25 at 20:36 +0000, William McVicker wrote: > > I found that my wlan driver is using the vendor commands to create/delete NAN > interfaces for this Android feature called Wi-Fi aware [1]. Basically, this > features allows users to discover other nearby devices and allows them to > connect directly with one another over a local network. > Wait, why is it doing that? We actually support a NAN interface type upstream :) It's not really quite fully fleshed out, but it could be? Probably should be? > Thread 1 Thread 2 > nl80211_pre_doit(): > rtnl_lock() > wiphy_lock() nl80211_pre_doit(): > rtnl_lock() // blocked by Thread 1 > nl80211_vendor_cmd(): > doit() > cfg80211_unregister_netdevice() > rtnl_unlock(): > netdev_run_todo(): > __rtnl_unlock() > <got RTNL lock> > wiphy_lock() // blocked by Thread 1 > rtnl_lock(); // DEADLOCK > nl80211_post_doit(): > wiphy_unlock(); Right, this is what I had discussed in my other mails. Basically, you're actually doing (some form of) unregister_netdevice() before rtnl_unlock(). Clearly this isn't possible in cfg80211 itself. However, I couldn't entirely discount the possibility that this is possible: Thread 1 Thread 2 rtnl_lock() unregister_netdevice() __rtnl_unlock() rtnl_lock() wiphy_lock() netdev_run_todo() __rtnl_unlock() // list not empty now // because of thread 2 rtnl_lock() rtnl_lock() wiphy_lock() ** DEADLOCK ** Given my other discussion with Jakub though, it seems that we can indeed make sure that this cannot happen, and then this scenario is impossible without the unregistration you're doing. > Since I'm unlocking the RTNL inside nl80211_vendor_cmd() after calling doit() > instead of waiting till post_doit(), I get into the situation you mentioned > where the net_todo_list is not empty when calling rtnl_unlock. So I decided to > drop the rtnl_unlock() in nl80211_vendor_cmd() and defer that until > nl80211_post_doit() after calling wiphy_unlock(). With this change, I haven't > been able to reproduce the deadlock. So it's possible that we aren't actually > able to hit this deadlock in nl80211_pre_doit() with the existing code since, > as you mentioned, one wouldn't be able to call unregister_netdevice() without > having the RTNL lock. > Right, this is why I said earlier that actually adding a flag for vendor commands to get the RTNL would be more complex - you'd have to basically open-code pre_doit() and post_doit() in there and check the sub-command flag at the very beginning and very end. johannes