On 03/21/2022, Johannes Berg wrote: > Hi, > > > > Basically, my wlan driver uses the wiphy_vendor_command ops to handle > > a number of vendor specific operations. > > > > I guess it's an out-of-tree driver, since I (hope I) fixed all the > issues in the code here ... :) > > > One of them in particular deletes > > a cfg80211 interface. > > There's quite normal API for that, why would you do that?! Yeah, unfortunately this is the code I was given. > > > The deadlock happens when thread 1 tries to take the > > RTNL lock before calling cfg80211_unregister_device() while thread 2 is > > inside nl80211_pre_doit(), holding the RTNL lock, and waiting on > > wiphy_lock(). > > > > Here is the call flow: > > > > Thread 1: Thread 2: > > > > nl80211_pre_doit(): > > -> rtnl_lock() > > nl80211_pre_doit(): > > -> rtnl_lock() > > -> <blocked by Thread 1> > > -> wiphy_lock() > > -> rtnl_unlock() > > -> <unblock Thread 1> > > exit nl80211_pre_doit() > > <Thread 2 got the RTNL lock> > > -> wiphy_lock() > > -> <blocked by Thread 1> > > nl80211_doit() > > -> nl80211_vendor_cmd() > > -> rtnl_lock() <DEADLOCK> > > Yeah, I guess the way we invoke vendor commands now w/o RTNL held means > you cannot safely acquire RTNL in them. > > I mean, the whole above thing basically collapses down to > > Thread 1 Thread 2 > wiphy_lock(); // nl80211 > rtnl_lock(); > wiphy_lock(); > rtnl_lock(); // your driver > > The correct order to _acquire_ these is rtnl -> wiphy, and we do it that > way around everywhere (else). > > > I'm not an networking expert. So my main question is if I'm allowed to take > > the RTNL lock inside the nl80211_vendor_cmd callbacks? > > Evidently, you're not. It's interesting though, it used to be that we > called these with the RTNL held, now we don't, and the driver you're > using somehow "got fixed" to take it, but whoever fixed it didn't take > into account that this is not possible? So in my quest to upgrade the Pixel 6 kernel from 5.10 to 5.15, I noticed that I was hitting several ASSERT_RTNL() warnings during wlan testing. When I dug into those asserts, I found commit a05829a7222e ("cfg80211: avoid holding the RTNL when calling the driver") was causing these issues. So I went about adding the necessary locks in the driver which led me to find this ABBA deadlock scenario. > > > I hope that helps explain the issue. Let me know if you need any more > > details. > > It does, but I don't think there's any way to fix it. You just > fundamentally cannot acquire the RTNL in a vendor command operation > since that introduced the ABBA deadlock you observed. > > Since it's an out-of-tree driver that's about as much as I can help. > > johannes Yeah, I understand. Thanks for the response! --Will