On Wed, 2023-11-08 at 07:07 -0800, Ben Greear wrote: > On 11/8/23 2:31 AM, Johannes Berg wrote: > > On Tue, 2023-11-07 at 14:08 -0800, Ben Greear wrote: > > > Hello, > > > > > > I think this lockup is because iw is holding rtnl and wiphy mutex, > > > and is blocked waiting for debugfs to be closed. Another 'cat' > > > program has debugfs file open, and is blocking on trying to acquire > > > wiphy mutex. > > > > > > I think we must not acquire wiphy mutex in debugfs methods, somehow, > > > to resolve this deadlock. I do not know a safe way to do that. > > > > Hmm. I almost want to say "don't do that then", but I guess you're just > > randomly accessing debugfs files. > > > > I guess we can at least make the mutex acquisition in debugfs killable > > (or interruptible), so you can recover from this. > > If we can detect that the phy is going away in debugfs, then we could > return early before attempting the lock? That would catch most things, > I guess, > I don't think it would, it would still get locked on the mutex first. > but still a potential race since I guess we'd have to do that check > w/out locks. Can we do a try-mutex-lock, if not acquired, return if wiphy-going-away, > else sleep a bit, try again? That's kind of awful though? And it's not just the wiphy going away, a lot of the debugfs files can go away individually (per station, per link, per key even!). So really what you'd need is a debugfs-level infrastructure to "send a signal to all the things that are keeping the file open"? I suppose that could even be done, in theory, but not in wifi by itself. > Or, can we grab rtnl before we even open the debugfs file, like in the .open method? Not RTNL, but rather wiphy mutex, but the question still stands - but no, the open method has the same problem. If we acquire it there, it still goes through the proxy fops in debugfs, so it'll still wait for it to be done. It'll just shift the problem to another place. > Or can we remove the debugfs files after rtnl but before we lock the wiphy mutex > in the destruction path? For some maybe yes, but for a lot of them like link/sta/key removal not really. > I have been running similar code for...like 15 years, and haven't seen this particular > deadlock before, so I think it is at least exacerbated by the locking changes. Or maybe > I had particularly bad luck yesterday.... Oh, it almost certainly did get at least worse or perhaps introduced by (a) moving everything to a single lock and (b) moving debugfs file removal under the lock. johannes