Hi Alexander, miquel.raynal@xxxxxxxxxxx wrote on Fri, 19 Aug 2022 19:09:44 +0200: > Hi Alexander, > > aahringo@xxxxxxxxxx wrote on Tue, 5 Jul 2022 21:23:21 -0400: > > > Hi, > > > > On Fri, Jul 1, 2022 at 10:37 AM Miquel Raynal <miquel.raynal@xxxxxxxxxxx> wrote: > > > > > > There is no need to ensure the rtnl is locked when changing a driver's > > > channel. This cause issues when scanning and this is the only driver > > > relying on it. Just drop this dependency because it does not seem > > > legitimate. > > > > > > > switching channels relies on changing pib attributes, pib attributes > > are protected by rtnl. If you experience issues here then it's > > probably because you do something wrong. All drivers assuming here > > that rtnl lock is held. > > ---8<--- > > especially this change could end in invalid free. Maybe we can solve > > this problem in a different way, what exactly is the problem by > > helding rtnl lock? > --->8--- > > During a scan we need to change channels. So when the background job > kicks-in, we first acquire scan_lock, then we check the internal > parameters of the structure protected by this lock, like the next > channel to use and the sdata pointer. A channel change must be > performed, preceded by an rtnl_lock(). This will again trigger a > possible circular lockdep dependency warning because the triggering path > acquires the rtnl (as part of the netlink layer) before the scan lock. > > One possible solution would be to do the following: > scan_work() { > acquire(scan_lock); > // do some config > release(scan_lock); > rtnl_lock(); > perform_channel_change(); > rtnl_unlock(); > acquire(scan_lock); > // reinit the scan struct ptr and the sdata ptr > // do some more things > release(scan_lock); > } > > It looks highly non-elegant IMHO. Otherwise I need to stop verifying in > the drivers that the rtnl is taken. Any third option here? I've tried two other solutions. A/ Enforcing the dependency rtnl -> scan_lock This means always acquiring the rtnl before scan_lock, and in terms of code requires to take the rtnl in the scan worker. Of course enclosing the drv_change_chan() call would mean releasing the scan_lock in the middle and re-taking it after all, which would defeat the protection of the scan_req structure which the lock is supposed to enforce. So I went for acquiring the lock at the top, before acquiring scan_lock, of course. This does not work because we need to acquire the rtnl in the worker, while at the same time there are places like ->slave_close which need to acquire the worker lock (during flush_workqueue()) and this can only happen under rtnl. Lockdep then complains about a possible circular dependency. B/ Avoiding the rtnl in scan operations and allowing the reverse dependency, which is scan_lock -> rtnl I've drafted this solution because I think the scan operation do not really need the rtnl. This idea got reinforced when I found this wireless change: a05829a7222e ("cfg80211: avoid holding the RTNL when calling the driver"). But unfortunately I get the same issue again, with the ->close() implementation which needs to acquire the worker lock to flush, this makes a rtnl -> worker_lock dependency which is incompatible with a worker_lock -> scan_lock -> rtnl chain (this is is typically what should happen when changing the channel during a scan). So I looked at reducing the scope of scan_lock, in order to avoid taking it for too long and avoid the scan_lock -> rtnl or rtnl -> scan_lock dependency in the worker, but I think in the end it is a truly bad idea. Finally, I decided I could use another workqueue for the mac related commands which is not the one for the data. We don't care about flushing it because we _need_ the beacons/scan workers to be stopped, which is handled in their dedicated helpers. Doing so removes a rtnl -> worker_lock dependency, which allows to acquire the rtnl from the worker. I've mostly implemented it, I'll clean all this up and send a v2 tomorrow. Thanks, Miquèl