Search Linux Wireless

Re: [BUG] deadlock in nl80211_vendor_cmd

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 03/21/2022, Johannes Berg wrote:
> Hi,
> 
> 
> > Basically, my wlan driver uses the wiphy_vendor_command ops to handle
> > a number of vendor specific operations.
> > 
> 
> I guess it's an out-of-tree driver, since I (hope I) fixed all the
> issues in the code here ... :)
> 
> > One of them in particular deletes
> > a cfg80211 interface.
> 
> There's quite normal API for that, why would you do that?!

Yeah, unfortunately this is the code I was given.

> 
> > The deadlock happens when thread 1 tries to take the
> > RTNL lock before calling cfg80211_unregister_device() while thread 2 is
> > inside nl80211_pre_doit(), holding the RTNL lock, and waiting on
> > wiphy_lock().
> > 
> > Here is the call flow:
> > 
> > Thread 1:                         Thread 2:
> > 
> > nl80211_pre_doit():
> >   -> rtnl_lock()
> >                                       nl80211_pre_doit():
> >                                        -> rtnl_lock()
> >                                        -> <blocked by Thread 1>
> >   -> wiphy_lock()
> >   -> rtnl_unlock()
> >   -> <unblock Thread 1>
> > exit nl80211_pre_doit()
> >                                        <Thread 2 got the RTNL lock>
> >                                        -> wiphy_lock()
> >                                        -> <blocked by Thread 1>
> > nl80211_doit()
> >   -> nl80211_vendor_cmd()
> >       -> rtnl_lock() <DEADLOCK>
> 
> Yeah, I guess the way we invoke vendor commands now w/o RTNL held means
> you cannot safely acquire RTNL in them.
> 
> I mean, the whole above thing basically collapses down to
> 
>  Thread 1                           Thread 2
>   wiphy_lock(); // nl80211
>                                      rtnl_lock();
>                                      wiphy_lock();
>   rtnl_lock();  // your driver
> 
> The correct order to _acquire_ these is rtnl -> wiphy, and we do it that
> way around everywhere (else).
> 
> > I'm not an networking expert. So my main question is if I'm allowed to take
> > the RTNL lock inside the nl80211_vendor_cmd callbacks?
> 
> Evidently, you're not. It's interesting though, it used to be that we
> called these with the RTNL held, now we don't, and the driver you're
> using somehow "got fixed" to take it, but whoever fixed it didn't take
> into account that this is not possible?

So in my quest to upgrade the Pixel 6 kernel from 5.10 to 5.15, I noticed that
I was hitting several ASSERT_RTNL() warnings during wlan testing. When I dug
into those asserts, I found commit a05829a7222e ("cfg80211: avoid holding the
RTNL when calling the driver") was causing these issues. So I went about adding
the necessary locks in the driver which led me to find this ABBA deadlock
scenario.

> 
> > I hope that helps explain the issue. Let me know if you need any more
> > details.
> 
> It does, but I don't think there's any way to fix it. You just
> fundamentally cannot acquire the RTNL in a vendor command operation
> since that introduced the ABBA deadlock you observed.
> 
> Since it's an out-of-tree driver that's about as much as I can help.
> 
> johannes

Yeah, I understand.  Thanks for the response!

--Will



[Index of Archives]     [Linux Host AP]     [ATH6KL]     [Linux Wireless Personal Area Network]     [Linux Bluetooth]     [Wireless Regulations]     [Linux Netdev]     [Kernel Newbies]     [Linux Kernel]     [IDE]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite Hiking]     [MIPS Linux]     [ARM Linux]     [Linux RAID]

  Powered by Linux