On 03/21/2022, Will McVicker wrote: > On Thu, Mar 17, 2022 at 10:09 AM <willmcvicker@xxxxxxxxxx> wrote: > > > Hi, > > > > I wanted to report a deadlock that I'm hitting as a result of the upstream > > commit a05829a7222e ("cfg80211: avoid holding the RTNL when calling the > > driver"). I'm using the Pixel 6 with downstream version of the 5.15 kernel, > > but I'm pretty sure this will happen on the upstream tip-of-tree kernel as > > well. > > > > Basically, my wlan driver uses the wiphy_vendor_command ops to handle > > a number of vendor specific operations. One of them in particular deletes > > a cfg80211 interface. The deadlock happens when thread 1 tries to take the > > RTNL lock before calling cfg80211_unregister_device() while thread 2 is > > inside nl80211_pre_doit(), holding the RTNL lock, and waiting on > > wiphy_lock(). > > > > Here is the call flow: > > > > Thread 1: Thread 2: > > > > nl80211_pre_doit(): > > -> rtnl_lock() > > nl80211_pre_doit(): > > -> rtnl_lock() > > -> <blocked by Thread 1> > > -> wiphy_lock() > > -> rtnl_unlock() > > -> <unblock Thread 1> > > exit nl80211_pre_doit() > > <Thread 2 got the RTNL lock> > > -> wiphy_lock() > > -> <blocked by Thread 1> > > nl80211_doit() > > -> nl80211_vendor_cmd() > > -> rtnl_lock() <DEADLOCK> > > -> cfg80211_unregister_device() > > -> rtnl_unlock() > > > > > > To be complete, here are the kernel call traces when the deadlock occurs: > > > > Thread 1 Call trace: > > <Take rtnl before calling cfg80211_unregister_device()> > > nl80211_vendor_cmd+0x210/0x218 > > genl_rcv_msg+0x3ac/0x45c > > netlink_rcv_skb+0x130/0x168 > > genl_rcv+0x38/0x54 > > netlink_unicast_kernel+0xe4/0x1f4 > > netlink_unicast+0x128/0x21c > > netlink_sendmsg+0x2d8/0x3d8 > > > > Thread 2 Call trace: > > <Take wiphy_lock> > > nl80211_pre_doit+0x1b0/0x250 > > genl_rcv_msg+0x37c/0x45c > > netlink_rcv_skb+0x130/0x168 > > genl_rcv+0x38/0x54 > > netlink_unicast_kernel+0xe4/0x1f4 > > netlink_unicast+0x128/0x21c > > netlink_sendmsg+0x2d8/0x3d8 > > > > I'm not an networking expert. So my main question is if I'm allowed to take > > the RTNL lock inside the nl80211_vendor_cmd callbacks? If so, then > > regardless of why I take it, we shouldn't be allowing this deadlock > > situation, right? > > > > I hope that helps explain the issue. Let me know if you need any more > > details. > > > > Thanks, > > Will > > > > Sorry my CC list got dropped. Adding the following: > > Kalle Valo <kvalo@xxxxxxxxxxxxxx> > "David S. Miller" <davem@xxxxxxxxxxxxx> > Jakub Kicinski <kuba@xxxxxxxxxx> > netdev@xxxxxxxxxxxxxxx > Amitkumar Karwar <amitkarwar@xxxxxxxxx> > Ganapathi Bhat <ganapathi.bhat@xxxxxxx> > Xinming Hu <huxinming820@xxxxxxxxx> > kernel-team@xxxxxxxxxxx Sorry for the noise. The lists bounced due to html. Resending with mutt to make sure everyone gets this message. As an update, I was able to fix the deadlock by updating nl80211_pre_doit() to not hold the RTNL lock while waiting to get the wiphy_lock. This allows us to take the RTNL lock within nl80211_doit() and have parallel calls to nl80211_doit(). Below is the logic I tested. Please let me know if I'm heading in the right direction. Thanks, Will diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c index 686a69381731..bb4ad746509b 100644 --- a/net/wireless/nl80211.c +++ b/net/wireless/nl80211.c @@ -15227,7 +15227,24 @@ static int nl80211_pre_doit(const struct genl_ops *ops, struct sk_buff *skb, } if (rdev && !(ops->internal_flags & NL80211_FLAG_NO_WIPHY_MTX)) { - wiphy_lock(&rdev->wiphy); + while (!mutex_trylock(&rdev->wiphy.mtx)) { + /* Holding the RTNL lock while waiting for the wiphy lock can lead to + * a deadlock within doit() ops that don't hold the RTNL in pre_doit. So + * we need to release the RTNL lock first while we wait for the wiphy + * lock. + */ + rtnl_unlock(); + wiphy_lock(&rdev->wiphy); + + /* Once we get the wiphy_lock, we need to grab the RTNL lock. If we can't + * get it, then we need to unlock the wiphy to avoid a deadlock in + * pre_doit and then retry taking the locks again. */ + if (!rtnl_trylock()) { + wiphy_unlock(&rdev->wiphy); + rtnl_lock(); + } else + break; + } /* we keep the mutex locked until post_doit */ __release(&rdev->wiphy.mtx); }