Re: [PATCH v2] module: Don't wait for GOING modules

Petr Mladek <pmladek@xxxxxxxx> · Tue, 13 Dec 2022 11:17:42 +0100

On Mon 2022-12-12 21:09:19, Luis Chamberlain wrote:
> On Mon, Dec 05, 2022 at 11:35:57AM +0100, Petr Pavlu wrote:
> > During a system boot, it can happen that the kernel receives a burst of
> > requests to insert the same module but loading it eventually fails
> > during its init call. For instance, udev can make a request to insert
> > a frequency module for each individual CPU when another frequency module
> > is already loaded which causes the init function of the new module to
> > return an error.
> > 
> > This patch attempts a different solution for the problem 6e6de3dee51a
> > was trying to solve. Rather than waiting for the unloading to complete,
> > it returns a different error code (-EBUSY) for modules in the GOING
> > state. This should avoid the error situation that was described in
> > 6e6de3dee51a (user space attempting to load a dependent module because
> > the -EEXIST error code would suggest to user space that the first module
> > had been loaded successfully), while avoiding the delay situation too.
> > 
> 
> So sorry for the late review but these ideas only came to me as I
> drafted my pull request to Linus.
> 
> Let's recap the issue from the start as this is why I ended having
> to pause for a second and couldn't believe myself as I wrote that
> we had merged this.

IMHO, it is perfectly fine to delay this. The problem is not trivial.
It always made my head spin. There is a risk of regression. The change
spent in linux next only one week. And this particular merge window is
not good for risking because of the holiday season.

> > diff --git a/kernel/module/main.c b/kernel/module/main.c
> > index d02d39c7174e..7a627345d4fd 100644
> > --- a/kernel/module/main.c
> > +++ b/kernel/module/main.c
> > @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
> >  	sched_annotate_sleep();
> >  	mutex_lock(&module_mutex);
> >  	mod = find_module_all(name, strlen(name), true);
> > -	ret = !mod || mod->state == MODULE_STATE_LIVE;
> > +	ret = !mod || mod->state == MODULE_STATE_LIVE
> > +		|| mod->state == MODULE_STATE_GOING;
> >  	mutex_unlock(&module_mutex);
> >  
> >  	return ret;
> > @@ -2562,20 +2563,35 @@ static int add_unformed_module(struct module *mod)
> >  
> >  	mod->state = MODULE_STATE_UNFORMED;
> >  
> > -again:
> 
> So this is part of my biggest concern for regression, the removal of
> this tag and its use.
> 
> Before this we always looped back to trying again and again.

Just to be sure that we are on the same page.

The loop was _not_ infinite. It serialized all attempts to load
the same module. In our case, it serialized all failures and
prolonged the pain.

> >  	mutex_lock(&module_mutex);
> >  	old = find_module_all(mod->name, strlen(mod->name), true);
> >  	if (old != NULL) {
> > -		if (old->state != MODULE_STATE_LIVE) {
> > +		if (old->state == MODULE_STATE_COMING
> > +		    || old->state == MODULE_STATE_UNFORMED) {
> >  			/* Wait in case it fails to load. */
> >  			mutex_unlock(&module_mutex);
> >  			err = wait_event_interruptible(module_wq,
> >  					       finished_loading(mod->name));
> >  			if (err)
> >  				goto out_unlocked;
> > -			goto again;
> 
> We essentially bound this now, and before we didn't.
> 
> Yes we we wait for finished_loading() of the module -- but if udev is
> hammering tons of same requests, well, we *will* surely hit this, as
> many requests managed to get in before userspace saw the module present.
> 
> While this typically can be useful, it means *quite a bit* of conditions which
> definitely *did* happen before will now *bail out* fast, to the extent
> that I'm not even sure why we just re-try once now.

I do not understand this. We do _not_ re-try the load in the new
version. We just wait for the result of the parallel attempt to
load the module.

Maybe, you are confused that we repeat find_module_all(). But it is
the way how to find the result of the parallel load.

> If we're going to 
> just re-check *once* why not do something graceful like *at least*
> cond_resched() to let the system breathe for a *tiny bit*.

We must check the result under module_mutex. We have to take this
sleeping lock. There is actually a rescheduling. I do not think that
cond_resched() would do any difference.

> > +
> > +			/* The module might have gone in the meantime. */
> > +			mutex_lock(&module_mutex);
> > +			old = find_module_all(mod->name, strlen(mod->name),
> > +					      true);
> >  		}
> > -		err = -EEXIST;
> > +
> > +		/*
> > +		 * We are here only when the same module was being loaded. Do
> > +		 * not try to load it again right now. It prevents long delays
> > +		 * caused by serialized module load failures. It might happen
> > +		 * when more devices of the same type trigger load of
> > +		 * a particular module.
> > +		 */
> > +		if (old && old->state == MODULE_STATE_LIVE)
> > +			err = -EEXIST;
> > +		else
> > +			err = -EBUSY;
> 
> And for all those use cases we end up here now, with -EBUSY. So udev
> before was not bounded, and kept busy-looping on the retry in-kernel,
> and we now immediately bound its condition to just 2 tries to see if the
> old module existed and now *return* a new value to userspace.
> 
> My main concerns are:
> 
> 0) Why not use cond_resched() if we're just going to check twice?

We take module_mutex. It should cause even bigger delay than cond_resched().

> 1) How are we sure we are not regressing userspace by removing the boundless
> loop there? (even if the endless loop was stupid)

We could not be sure. On the other hand, if more attempts help to load
the module then it is racy and not reliable. The new approach would
make it better reproducible and fix the race.

> 2) How is it we expect that we won't resgress userspace now by bounding
> that check and pretty much returning -EBUSY right away? This last part
> seems dangerous, in that if userspace did not expect -EBUSY and if an
> error before caused a module to fail and fail boot, why wouldn't we fail
> boot now by bailing out faster??

Same answer as for 1)

> 3) *Fixing* a kernel regression by adding new expected API for testing
> against -EBUSY seems not ideal.

IMHO, the right solution is to fix the subsystems so that they send
only one uevent.

The question is how the module loader would deal with "broken"
subsystems. Petr Pavlu, please, fixme. I think that there are
more subsystems doing this ugly thing.

I personally thing that returning -EBUSY is better than serializing
all the loads. It makes eventual problem easier to reproduce and fix.

Best Regards,
Petr