On Thu, 8 Aug 2024 at 21:40, Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> wrote: > > The immediate cause of the recursive load is the self-test system > (if it is not disabled through Kconfig). The algorithm registration > does not return until after the self-test has successfully executed. > For the algorithm in question, that involves loading a fallback > algorithm which is what triggered the recursive module load. Ahh. I tried to figure out why it would load the same module recursively, and it was very unclear to me. > We had an issue when algorithms were built into the kernel, where > due to the random of registration calls, a self-test may invoke > an algorithm which is built into the kernel but not yet registered. > There it was resolved by postponing all self-tests until all > algorithms had been registered (or when an algorithm was first used, > which would trigger the self-test for that algorithm there and then). We don't have any generic module "do this asynchronously after you've been loaded", but I guess the crypto code itself could just do something like that when a new crypto algorithm has been registered? The keyword being that "do this _asynchronouysly_" so that it doesn't hold up the module init itself.. > Russell, is it OK with you if we only resolve this in the mainline > kernel or do you want a solution that can be backported as well? I actually had what I thought was a cunning plan, and thought that I could fix this by reorganizing the module loading and relying on the module_mutex itself to avoid the race that happens when you release waiters early. But it turns out my cunning plan was just me being stupid, because we really can't hold the module mutex over the initcall itself, and that's the part (well, _one_ of the parts) that needs protection. And in fact, as part of shooting down my not-so-cunning plan I convinced myself that I don't think this recursive load actually ever worked at all, and it would always hang on the recursive module load. But before that commit 9828ed3f695a, that hang was in module_patient_check_exists(), and it would be interruptible. So any signal would then cause the nested module loading to break out with an error, the first module load would finish happilt, and at least it wouldn't hang forever. End result: I now have a new plan - I'll make the wait_for_completion(&idem.complete) be interruptible and return -EINTR (and I'll have to clean up the wait-queues etc). That should make all this work the effectively the same way the old path in module_patient_check_exists() used to work (and still does, for the non-file-load case). That should be a distinct improvement, and at least get us back the old historical behavior. It still doesn't make the recursive module load _work_, but it won't be the "hung forever" disaster (and regression) that it is now. But considering how not-cunning my first plan was, I'll sleep on it first. I think this plan actually works, but I'm not going to start implementing it at 10pm. And then the crypto layer fixing the actual recursion issue will fix the underlying problem. Linus