Re: [BUG] More issues with arm/aes-neonbs

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 8 Aug 2024 22:19:36 -0700

On Thu, 8 Aug 2024 at 21:40, Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> wrote:
>
> The immediate cause of the recursive load is the self-test system
> (if it is not disabled through Kconfig).  The algorithm registration
> does not return until after the self-test has successfully executed.
> For the algorithm in question, that involves loading a fallback
> algorithm which is what triggered the recursive module load.

Ahh. I tried to figure out why it would load the same module
recursively, and it was very unclear to me.

> We had an issue when algorithms were built into the kernel, where
> due to the random of registration calls, a self-test may invoke
> an algorithm which is built into the kernel but not yet registered.
> There it was resolved by postponing all self-tests until all
> algorithms had been registered (or when an algorithm was first used,
> which would trigger the self-test for that algorithm there and then).

We don't have any generic module "do this asynchronously after you've
been loaded", but I guess the crypto code itself could just do
something like that when a new crypto algorithm has been registered?

The keyword being that "do this _asynchronouysly_" so that it doesn't
hold up the module init itself..

> Russell, is it OK with you if we only resolve this in the mainline
> kernel or do you want a solution that can be backported as well?

I actually had what I thought was a cunning plan, and thought that I
could fix this by reorganizing the module loading and relying on the
module_mutex itself to avoid the race that happens when you release
waiters early.

But it turns out my cunning plan was just me being stupid, because we
really can't hold the module mutex over the initcall itself, and
that's the part (well, _one_ of the parts) that needs protection.

And in fact, as part of shooting down my not-so-cunning plan I
convinced myself that I don't think this recursive load actually ever
worked at all, and it would always hang on the recursive module load.

But before that commit 9828ed3f695a, that hang  was in
module_patient_check_exists(), and it would be interruptible.

So any signal would then cause the nested module loading to break out
with an error, the first module load would finish happilt, and at
least it wouldn't hang forever.

End result: I now have a new plan - I'll make the
wait_for_completion(&idem.complete) be interruptible and return -EINTR
(and I'll have to clean up the wait-queues etc).

That should make all this work the effectively the same way the old
path in module_patient_check_exists() used to work (and still does,
for the non-file-load case).

That should be a distinct improvement, and at least get us back the
old historical behavior. It still doesn't make the recursive module
load _work_, but it won't be the "hung forever" disaster (and
regression) that it is now.

But considering how not-cunning my first plan was, I'll sleep on it
first. I think this plan actually works, but I'm not going to start
implementing it at 10pm.

And then the crypto layer fixing the actual recursion issue will fix
the underlying problem.

               Linus