On Thu, 8 Aug 2024 at 10:14, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > (Please note: ENTIRELY UNTESTED! It compiles for me, but I might have > done something incredibly stupid and maybe there's some silly and > fatal bug in what _appears_ trivially correct). Ok, I fixed the stupid timeout check condition, and I actually ended up testing this on my system by making the timeout be just 10ms instead of ten seconds. With that, I get a nice module 'hid-logitech-dj.ko' taking a long time to load message about the "problem", so at least the warning seems to work. I've committed that (with the timeout for the warning set back to 10s), not because it *fixes* anything, but because the warning itself is hopefully useful to avoid having to debug issues like this in the future, and because it also re-organizes the code so that any possible "break dependency on recursion detection" thing would be easier to deal with. That said, the *proper* fix really is to make sure that a module doesn't recursively try to load itself. Because the thing that happened to make it work before the "wait until the previous module has completely finished loading" was that we *used* to do "wait until the both module has _almost_ completely finished loading, then return an error if there was another copy of it". And it turns out that the "return an error if there's a concurrent load" is fundmanetally racy if the concurrent loaders are actually concurrent (not a serial recursion). The race is typically *small*, but when I made it bigger in commit 9828ed3f695a, it actually ended up triggering in real life for modules that had dependencies and returning an error before the module had finished would then cause cascading errors. So the race does exist in real life. In the case of the actual recursive invocation, that race isn't an issue - it's all serial - and returning an error before fully initializing the first module is actually what you want (since it won't _become_ fully initialized if you wait for it). But basically the old recursion avoidance was not really reliable in other situations, which is why we don't want to re-introduce the "error out early" behavior. I don't know the crypto registration API enough to even guess at what the fix to break the recursion is. Herbert? Linus