On 11/8/23 2:31 AM, Johannes Berg wrote:
On Tue, 2023-11-07 at 14:08 -0800, Ben Greear wrote:
Hello,
I think this lockup is because iw is holding rtnl and wiphy mutex,
and is blocked waiting for debugfs to be closed. Another 'cat'
program has debugfs file open, and is blocking on trying to acquire
wiphy mutex.
I think we must not acquire wiphy mutex in debugfs methods, somehow,
to resolve this deadlock. I do not know a safe way to do that.
Hmm. I almost want to say "don't do that then", but I guess you're just
randomly accessing debugfs files.
I guess we can at least make the mutex acquisition in debugfs killable
(or interruptible), so you can recover from this.
If we can detect that the phy is going away in debugfs, then we could
return early before attempting the lock? That would catch most things,
I guess, but still a potential race since I guess we'd have to do that check
w/out locks. Can we do a try-mutex-lock, if not acquired, return if wiphy-going-away,
else sleep a bit, try again?
But fundamentally this is probably not really even a new issue.
I don't know how to interrupt a specific task that's stuck in a specific
debugfs file though, e.g. when removing them.
Or, can we grab rtnl before we even open the debugfs file, like in the .open method?
Or can we remove the debugfs files after rtnl but before we lock the wiphy mutex
in the destruction path?
I have been running similar code for...like 15 years, and haven't seen this particular
deadlock before, so I think it is at least exacerbated by the locking changes. Or maybe
I had particularly bad luck yesterday....
Thanks,
Ben
--
Ben Greear <greearb@xxxxxxxxxxxxxxx>
Candela Technologies Inc http://www.candelatech.com