Re: [PATCH v2 2/2] module: Merge same-name module load requests

Petr Mladek <pmladek@xxxxxxxx> · Fri, 2 Dec 2022 17:36:12 +0100

On Tue 2022-11-29 14:13:37, Petr Pavlu wrote:
> On 11/28/22 17:29, Prarit Bhargava wrote:
> > On 11/14/22 10:45, David Hildenbrand wrote:
> >> On 14.11.22 16:38, Luis Chamberlain wrote:
> >>> On Mon, Nov 14, 2022 at 09:57:56AM +0100, David Hildenbrand wrote:
> >>>> On 12.11.22 02:47, Luis Chamberlain wrote:
> >>>>> On Wed, Oct 19, 2022 at 02:00:55PM +0200, Petr Pavlu wrote:
> >>>>>> On 10/18/22 20:33, Luis Chamberlain wrote:
> >>>>>>> On Sat, Oct 15, 2022 at 11:27:10AM +0200, Petr Pavlu wrote:
> >>>>>>>> The patch does address a regression observed after commit 
> >>>>>>>> 6e6de3dee51a
> >>>>>>>> ("kernel/module.c: Only return -EEXIST for modules that have 
> >>>>>>>> finished
> >>>>>>>> loading"). I guess it can have a Fixes tag added to the patch.
> >>>>>>>>
> >>>>>>>> I think it is hard to split this patch into parts because the 
> >>>>>>>> implemented
> >>>>>>>> "optimization" is the fix.
> >>>>>>>
> >>>>>>> git describe --contains 6e6de3dee51a
> >>>>>>> v5.3-rc1~38^2~6
> >>>>>>>
> >>>>>>> I'm a bit torn about this situation. Reverting 6e6de3dee51a would 
> >>>>>>> be the
> >>>>>>> right thing to do, but without it, it still leaves the issue reported
> >>>>>>> by Prarit Bhargava. We need a way to resolve the issue on stable and
> >>>>>>> then your optimizations can be applied on top.
> >>>>>>
> >>>>>> Simpler could be to do the following:
> >>>>>>
> >>>>>> diff --git a/kernel/module/main.c b/kernel/module/main.c
> >>>>>> index d02d39c7174e..0302ac387e93 100644
> >>>>>> --- a/kernel/module/main.c
> >>>>>> +++ b/kernel/module/main.c
> >>>>>> @@ -2386,7 +2386,8 @@ static bool finished_loading(const char *name)
> >>>>>>        sched_annotate_sleep();
> >>>>>>        mutex_lock(&module_mutex);
> >>>>>>        mod = find_module_all(name, strlen(name), true);
> >>>>>> -    ret = !mod || mod->state == MODULE_STATE_LIVE;
> >>>>>> +    ret = !mod || mod->state == MODULE_STATE_LIVE
> >>>>>> +        || mod->state == MODULE_STATE_GOING;
> >>>>>>        mutex_unlock(&module_mutex);
> >>>>>>        return ret;
> >>>>>> @@ -2566,7 +2567,8 @@ static int add_unformed_module(struct module 
> >>>>>> *mod)
> >>>>>>        mutex_lock(&module_mutex);
> >>>>>>        old = find_module_all(mod->name, strlen(mod->name), true);
> >>>>>>        if (old != NULL) {
> >>>>>> -        if (old->state != MODULE_STATE_LIVE) {
> >>>>>> +        if (old->state == MODULE_STATE_COMING
> >>>>>> +            || old->state == MODULE_STATE_UNFORMED) {
> >>>>>>                /* Wait in case it fails to load. */
> >>>>>>                mutex_unlock(&module_mutex);
> >>>>>>                err = wait_event_interruptible(module_wq,
> >>>>>> @@ -2575,7 +2577,7 @@ static int add_unformed_module(struct module 
> >>>>>> *mod)
> >>>>>>                    goto out_unlocked;
> >>>>>>                goto again;
> >>>>>>            }
> >>>>>> -        err = -EEXIST;
> >>>>>> +        err = old->state != MODULE_STATE_LIVE ? -EBUSY : -EEXIST;
> >>>>>>            goto out;
> >>>>>>        }
> >>>>>>        mod_update_bounds(mod);
> >>>>>
> >>>>>
> >>>>> Prarit, can you verify this still does not break the issue you 
> >>>>> reported?
> >>>>> David, does this also fix your issue?
> >>>>
> >>>> I didn't try, but from a quick glimpse I assume no. Allocating module 
> >>>> space
> >>>> happens before handling eventual duplicates right now, before a 
> >>>> module even
> >>>> is "alive" and in the MODULE_STATE_UNFORMED state.
> >>>
> >>> The first two hunks are a revert of commit 6e6de3dee51a and I'm under
> >>> the impression that cauased your issues as *more* modules states are
> >>> allowed through.
> >>>
> >>> The last hunk tries to fix what 6e6de3dee51a wanted to do.
> >>>
> >>
> >> Note that I don't think the issue I raised is due to 6e6de3dee51a.
> >>
> >>>> But maybe I am missing something important.
> >>>
> >>> Please do test if you can.
> >>
> >> I don't have the machine at hand right now. But, again, I doubt this 
> >> will fix it.
> >>
> >>
> >> The flow is in load_module():
> >>
> >>      mod = layout_and_allocate(info, flags);
> >>      if (IS_ERR(mod)) {
> >>          ...
> >>      }
> >>
> >>      audit_log_kern_module(mod->name);
> >>
> >>      /* Reserve our place in the list. */
> >>      err = add_unformed_module(mod);
> >>      if (err)
> >>          goto free_module;
> >>
> >>
> >> You can have 400 threads in layout_and_allocate() loading the same 
> >> module at the same time and running out of module space. Any changes to 
> >> add_unformed_module() and finished_loading() won't change that, because 
> >> they are not involved before the module space allocations happened.
> >>
> > 
> > I'd like to see a refreshed patch but I tested the latest version and 
> > see that the boot time is LONGER with the change
> > 
> > Before:
> > 
> > [11:17 AM root@intel-eaglestream-spr-15 kernel-ark]# systemd-analyze
> > Startup finished in 55.418s (firmware) + 22.766s (loader) + 35.856s 
> > (kernel) + 5.830s (initrd) + 15.671s (userspace) = 2min 15.542s
> > multi-user.target reached after 15.606s in userspace.
> > 
> > After:
> > 
> > Startup finished in 55.314s (firmware) + 23.033s (loader) + 35.331s 
> > (kernel) + 5.176s (initrd) + 23.465s (userspace) = 2min 22.320s
> > multi-user.target reached after 23.093s in userspace.
> > 
> > Subsequent reboots also indicate that userspace boot time is longer 
> > after the change.
> 
> Thanks for testing this patch, that is an interesting result.
> 
> I see the following dependency chain on my system (openSUSE Tumbleweed):
> multi-user.target -> basic.target -> sysinit.target -> systemd-udev-trigger.service.
> 
> My understanding is that the udev trigger service only performs the trigger
> operation but does not actually wait on all devices to be processed by udevd.
> In other words, handling of the forced udev events can still be in progress
> after multi-user.target is reached.
> 
> The current serialization of same-name module loads can result in many udev
> workers sleeping in add_unformed_module() and hence creating at that point
> less pressure on the CPU time from udevd. I wonder if this then maybe allows
> other work needed to reach multi-user.target to proceed faster.

Interesting theory.

Another idea. The module loader newly returns -EBUSY. I could imagine
that userspace might handle this return value a special way and try
to load the module once again.

It might make sense to test it with the updated version of the patch.
This one is racy. It returns -EBUSY only when the waiting module loader
sees the failed module loader in GOING state.

> Could you please boot the machine with 'udev.log_level=debug' and provide me
> logs ('journalctl -b -o short-monotonic') from a run with the vanilla kernel
> and with the discussed patch?

I am curious what happens here.

Best Regards,
Petr