[BUG] deadlock in nl80211_vendor_cmd

<willmcvicker@xxxxxxxxxx> · Thu, 17 Mar 2022 17:09:26 +0000

Hi,

I wanted to report a deadlock that I'm hitting as a result of the upstream
commit a05829a7222e ("cfg80211: avoid holding the RTNL when calling the
driver"). I'm using the Pixel 6 with downstream version of the 5.15 kernel,
but I'm pretty sure this will happen on the upstream tip-of-tree kernel as
well.

Basically, my wlan driver uses the wiphy_vendor_command ops to handle
a number of vendor specific operations. One of them in particular deletes
a cfg80211 interface. The deadlock happens when thread 1 tries to take the
RTNL lock before calling cfg80211_unregister_device() while thread 2 is
inside nl80211_pre_doit(), holding the RTNL lock, and waiting on
wiphy_lock().

Here is the call flow:

Thread 1:                         Thread 2:

nl80211_pre_doit():
 -> rtnl_lock()
                                     nl80211_pre_doit():
                                      -> rtnl_lock()
                                      -> <blocked by Thread 1>
 -> wiphy_lock()
 -> rtnl_unlock()
 -> <unblock Thread 1>
exit nl80211_pre_doit()
                                      <Thread 2 got the RTNL lock>
                                      -> wiphy_lock()
                                      -> <blocked by Thread 1>
nl80211_doit()
 -> nl80211_vendor_cmd()
     -> rtnl_lock() <DEADLOCK>
     -> cfg80211_unregister_device()
     -> rtnl_unlock()

To be complete, here are the kernel call traces when the deadlock occurs:

Thread 1 Call trace:
   <Take rtnl before calling cfg80211_unregister_device()>
   nl80211_vendor_cmd+0x210/0x218
   genl_rcv_msg+0x3ac/0x45c
   netlink_rcv_skb+0x130/0x168
   genl_rcv+0x38/0x54
   netlink_unicast_kernel+0xe4/0x1f4
   netlink_unicast+0x128/0x21c
   netlink_sendmsg+0x2d8/0x3d8

Thread 2 Call trace:
   <Take wiphy_lock>
   nl80211_pre_doit+0x1b0/0x250
   genl_rcv_msg+0x37c/0x45c
   netlink_rcv_skb+0x130/0x168
   genl_rcv+0x38/0x54
   netlink_unicast_kernel+0xe4/0x1f4
   netlink_unicast+0x128/0x21c
   netlink_sendmsg+0x2d8/0x3d8

I'm not an networking expert. So my main question is if I'm allowed to take
the RTNL lock inside the nl80211_vendor_cmd callbacks? If so, then
regardless of why I take it, we shouldn't be allowing this deadlock
situation, right?

I hope that helps explain the issue. Let me know if you need any more
details.

Thanks,
Will