Re: [PATCH v2] module: Don't wait for GOING modules

Lucas De Marchi <lucas.demarchi@xxxxxxxxx> · Wed, 18 Jan 2023 17:15:04 -0800

On Tue, Jan 17, 2023 at 04:04:22PM -0800, Luis Chamberlain wrote:
On Tue, Dec 13, 2022 at 11:17:42AM +0100, Petr Mladek wrote:
On Mon 2022-12-12 21:09:19, Luis Chamberlain wrote:
> On Mon, Dec 05, 2022 at 11:35:57AM +0100, Petr Pavlu wrote:
> > diff --git a/kernel/module/main.c b/kernel/module/main.c
> > index d02d39c7174e..7a627345d4fd 100644
> > --- a/kernel/module/main.c
> > +++ b/kernel/module/main.c
> > @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
> >  	sched_annotate_sleep();
> >  	mutex_lock(&module_mutex);
> >  	mod = find_module_all(name, strlen(name), true);
> > -	ret = !mod || mod->state == MODULE_STATE_LIVE;
> > +	ret = !mod || mod->state == MODULE_STATE_LIVE
> > +		|| mod->state == MODULE_STATE_GOING;
> >  	mutex_unlock(&module_mutex);
> >
> >  	return ret;
> > @@ -2562,20 +2563,35 @@ static int add_unformed_module(struct module *mod)
> >
> >  	mod->state = MODULE_STATE_UNFORMED;
> >
> > -again:
>
> So this is part of my biggest concern for regression, the removal of
> this tag and its use.
>
> Before this we always looped back to trying again and again.

Just to be sure that we are on the same page.

The loop was _not_ infinite. It serialized all attempts to load
the same module. In our case, it serialized all failures and
prolonged the pain.

That's fair yes. The loop happens so long as an already existing module is
present with the same name.

> >  	mutex_lock(&module_mutex);
> >  	old = find_module_all(mod->name, strlen(mod->name), true);
> >  	if (old != NULL) {
> > -		if (old->state != MODULE_STATE_LIVE) {
> > +		if (old->state == MODULE_STATE_COMING
> > +		    || old->state == MODULE_STATE_UNFORMED) {
> >  			/* Wait in case it fails to load. */
> >  			mutex_unlock(&module_mutex);
> >  			err = wait_event_interruptible(module_wq,
> >  					       finished_loading(mod->name));
> >  			if (err)
> >  				goto out_unlocked;
> > -			goto again;
>
> We essentially bound this now, and before we didn't.
>
> Yes we we wait for finished_loading() of the module -- but if udev is
> hammering tons of same requests, well, we *will* surely hit this, as
> many requests managed to get in before userspace saw the module present.
>
> While this typically can be useful, it means *quite a bit* of conditions which
> definitely *did* happen before will now *bail out* fast, to the extent
> that I'm not even sure why we just re-try once now.

I do not understand this. We do _not_ re-try the load in the new
version. We just wait for the result of the parallel attempt to
load the module.

Maybe, you are confused that we repeat find_module_all(). But it is
the way how to find the result of the parallel load.

My point is that prior to the buggy commit 6e6de3dee51a ("kernel/module.c: Only
return -EEXIST for modules that have finished loading") and even after that
commit it we 'goto again' if an old request is found. We now simply bound this
right away. Yes, the loop was not infinite, but in theory at least a few
iterations were possible before whereas now immediately return -EBUSY
and I don't think all use cases may be ready yet.

> If we're going to
> just re-check *once* why not do something graceful like *at least*
> cond_resched() to let the system breathe for a *tiny bit*.

We must check the result under module_mutex. We have to take this
sleeping lock. There is actually a rescheduling. I do not think that
cond_resched() would do any difference.

Makes sense.

> > +
> > +			/* The module might have gone in the meantime. */
> > +			mutex_lock(&module_mutex);
> > +			old = find_module_all(mod->name, strlen(mod->name),
> > +					      true);
> >  		}
> > -		err = -EEXIST;
> > +
> > +		/*
> > +		 * We are here only when the same module was being loaded. Do
> > +		 * not try to load it again right now. It prevents long delays
> > +		 * caused by serialized module load failures. It might happen
> > +		 * when more devices of the same type trigger load of
> > +		 * a particular module.
> > +		 */
> > +		if (old && old->state == MODULE_STATE_LIVE)
> > +			err = -EEXIST;
> > +		else
> > +			err = -EBUSY;
>
> And for all those use cases we end up here now, with -EBUSY. So udev
> before was not bounded, and kept busy-looping on the retry in-kernel,
> and we now immediately bound its condition to just 2 tries to see if the
> old module existed and now *return* a new value to userspace.
>
> My main concerns are:
>
> 0) Why not use cond_resched() if we're just going to check twice?

We take module_mutex. It should cause even bigger delay than cond_resched().

ACK.

> 1) How are we sure we are not regressing userspace by removing the boundless
> loop there? (even if the endless loop was stupid)

We could not be sure. On the other hand, if more attempts help to load
the module then it is racy and not reliable. The new approach would
make it better reproducible and fix the race.

Yes, but the short cut it is a userspace visible change.

> 2) How is it we expect that we won't resgress userspace now by bounding
> that check and pretty much returning -EBUSY right away? This last part
> seems dangerous, in that if userspace did not expect -EBUSY and if an
> error before caused a module to fail and fail boot, why wouldn't we fail
> boot now by bailing out faster??

Same answer as for 1)

> 3) *Fixing* a kernel regression by adding new expected API for testing
> against -EBUSY seems not ideal.

IMHO, the right solution is to fix the subsystems so that they send
only one uevent.

Makes sense, but that can take time and some folks are stuck on old kernels
and perhaps porting fixes for this on subsystems may take time to land
to some enterprisy kernels. And then there is also systemd that issues
the requests too, at least that was reflected in commit 6e6de3dee51a
("kernel/module.c: Only return -EEXIST for modules that have finished loading")
that commit claims it was systemd issueing the requests which I mean to
interpret finit_module(), not calling modprobe.

just a comment on this, if it helps anything: commit 6e6de3dee51a says
systemd, but it would be more accurate to say udev or systemd-udev to
disambiguate with pid 1.

udev uses libkmod to load modules so it's pretty much the same logic as
modprobe.

Lucas De Marchi