On 11/8/23 7:44 AM, Johannes Berg wrote:
On Wed, 2023-11-08 at 07:07 -0800, Ben Greear wrote:
On 11/8/23 2:31 AM, Johannes Berg wrote:
On Tue, 2023-11-07 at 14:08 -0800, Ben Greear wrote:
Hello,
I think this lockup is because iw is holding rtnl and wiphy mutex,
and is blocked waiting for debugfs to be closed. Another 'cat'
program has debugfs file open, and is blocking on trying to acquire
wiphy mutex.
I think we must not acquire wiphy mutex in debugfs methods, somehow,
to resolve this deadlock. I do not know a safe way to do that.
Hmm. I almost want to say "don't do that then", but I guess you're just
randomly accessing debugfs files.
I guess we can at least make the mutex acquisition in debugfs killable
(or interruptible), so you can recover from this.
If we can detect that the phy is going away in debugfs, then we could
return early before attempting the lock? That would catch most things,
I guess,
I don't think it would, it would still get locked on the mutex first.
but still a potential race since I guess we'd have to do that check
w/out locks. Can we do a try-mutex-lock, if not acquired, return if wiphy-going-away,
else sleep a bit, try again?
That's kind of awful though? And it's not just the wiphy going away, a
lot of the debugfs files can go away individually (per station, per
link, per key even!).
From the backtrace in the removal logic, it seems that something waits
for a debugfs file to be closed. Maybe the logic attempting to get the
mutex in debugfs can check if file is waiting to be deleted,
combined with a try-mutex-lock logic, and bail out that way?
Thanks,
Ben
So really what you'd need is a debugfs-level infrastructure to "send a
signal to all the things that are keeping the file open"? I suppose that
could even be done, in theory, but not in wifi by itself.
Or, can we grab rtnl before we even open the debugfs file, like in the .open method?
Not RTNL, but rather wiphy mutex, but the question still stands - but
no, the open method has the same problem. If we acquire it there, it
still goes through the proxy fops in debugfs, so it'll still wait for it
to be done. It'll just shift the problem to another place.
Or can we remove the debugfs files after rtnl but before we lock the wiphy mutex
in the destruction path?
For some maybe yes, but for a lot of them like link/sta/key removal not
really.
I have been running similar code for...like 15 years, and haven't seen this particular
deadlock before, so I think it is at least exacerbated by the locking changes. Or maybe
I had particularly bad luck yesterday....
Oh, it almost certainly did get at least worse or perhaps introduced by
(a) moving everything to a single lock and (b) moving debugfs file
removal under the lock.
johannes
--
Ben Greear <greearb@xxxxxxxxxxxxxxx>
Candela Technologies Inc http://www.candelatech.com