Re: [PATCH v2 2/2] module: Merge same-name module load requests

Petr Pavlu <petr.pavlu@xxxxxxxx> · Sun, 16 Oct 2022 14:25:50 +0200

On 10/14/22 15:52, Petr Mladek wrote:
> On Mon 2022-09-19 14:32:33, Petr Pavlu wrote:
>> During a system boot, it can happen that the kernel receives a burst of
>> requests to insert the same module but loading it eventually fails
>> during its init call. For instance, udev can make a request to insert
>> a frequency module for each individual CPU when another frequency module
>> is already loaded which causes the init function of the new module to
>> return an error.
>>
>> The module loader currently serializes all such requests, with the
>> barrier in add_unformed_module(). This creates a lot of unnecessary work
>> and delays the boot.
>>
>> This patch improves the behavior as follows:
>> * A check whether a module load matches an already loaded module is
>>   moved right after a module name is determined. -EEXIST continues to be
>>   returned if the module exists and is live, -EBUSY is returned if
>>   a same-name module is going.
>> * A new reference-counted shared_load_info structure is introduced to
>>   keep track of duplicate load requests. Two loads are considered
>>   equivalent if their module name matches. In case a load duplicates
>>   another running insert, the code waits for its completion and then
>>   returns -EEXIST or -EBUSY depending on whether it succeeded.
>>
>> Note that prior to 6e6de3dee51a ("kernel/module.c: Only return -EEXIST
>> for modules that have finished loading"), the kernel already did merge
>> some of same load requests but it was more by accident and relied on
>> specific timing. The patch brings this behavior back in a more explicit
>> form.
> 
>> --- a/kernel/module/main.c
>> +++ b/kernel/module/main.c
>> @@ -61,14 +61,29 @@
>>  
>>  /*
>>   * Mutex protects:
>> - * 1) List of modules (also safely readable with preempt_disable),
>> + * 1) list of modules (also safely readable with preempt_disable, delete and add
>> + *    uses RCU list operations).
>>   * 2) module_use links,
>> - * 3) mod_tree.addr_min/mod_tree.addr_max.
>> - * (delete and add uses RCU list operations).
>> + * 3) mod_tree.addr_min/mod_tree.addr_max,
>> + * 4) list of unloaded_tainted_modules.
> 
> This is unrelated and should be done in separate patch. It minimizes
> the size of the patch and simplifies review. More importantly, these
> unrelated changes will not get lost when the patch gets reverted for
> some reason.

Ok.

>> + * 5) list of running_loads.
>>   */
>>  DEFINE_MUTEX(module_mutex);
>>  LIST_HEAD(modules);
>>  
>> +/* Shared information to track duplicate module loads. */
>> +struct shared_load_info {
>> +	char name[MODULE_NAME_LEN];
>> +	refcount_t refcnt;
>> +	struct list_head list;
>> +	struct completion done;
>> +	int err;
>> +};
>> +static LIST_HEAD(running_loads);
>> +
>> +/* Waiting for a module to finish loading? */
>> +static DECLARE_WAIT_QUEUE_HEAD(module_wq);
> 
> It is not obvious why this is actually needed. One would expect
> that using the completion in struct shared_load_info would
> be enough.
> 
> It is need because the user, resolve_symbol_wait(), does
> not know the exact name of the module that it is waiting for.
> 
> It would deserve a comment and maybe even renaming, for example:
> 
> /*
>  * Waiting for a module load when the exact module name is not known,
>  * for example, when resolving symbols from another modules.
>  */
> static DECLARE_WAIT_QUEUE_HEAD(any_module_load_wq);

I will adjust the comment.

>> +
>>  /* Work queue for freeing init sections in success case */
>>  static void do_free_init(struct work_struct *w);
>>  static DECLARE_WORK(init_free_wq, do_free_init);
>> @@ -762,8 +774,6 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
>>  	strscpy(last_unloaded_module.taints, module_flags(mod, buf, false), sizeof(last_unloaded_module.taints));
>>  
>>  	free_module(mod);
>> -	/* someone could wait for the module in add_unformed_module() */
>> -	wake_up_interruptible(&module_wq);
> 
> The new code does not longer wakes module_wq when the module is
> removed. I wondered if it is correct. It is a bit tricky.
> module_wq used to have two purposes. It was used to wake up:
> 
>    1. parallel loads of the same module.
> 
>    2. resolving symbols in a module using symbols from
>       another module.
> 
> 
> Ad 1. The new code handles the parallel load using struct shared_load_info.
>       Also the new code does not wait when the same module
>       is being removed. So the wake up is not needed here.
> 
> 
> Ad 2. The module_wq is typically used when the other module is
>       loaded automatically because of module dependencies.
>       In this case, only the wake up in load_module() is important.
> 
>       But what about module removal? IMHO, the is small race window:
> 
> 
> The problem is the module operations are asynchronous. And A takes
> the reference on B only after it resolves a symbol, see ref_module()
> called in resolve_symbol().
> 
> Let's have two modules A and B when the module A uses a symbol
> from module B.
> 
> 
> CPU 0:			CPU 1			CPU 3
> 
> modprobe A
>   // see dependency on B
>   // and call modprobe B
>   // in separate thread
> 
> 			modprobe B
> 			  return -EEXIST
> 
> 						rmmod B
> 						  // finished
> 
> 
>       resolve_symbol_wait(sym_from_B)
> 
> It has to wait until the timeout 30s because module B is gone
> and it is not beeing loaded.
> 
> IMHO, the new code makes the race window slightly bigger because
> modprobe B retuns immediately even when rmmod B is already in
> progress.
> 
> IMHO, this is acceptable because the race was there anyway. And
> it is not much realistic scenario.

I'm not sure if I understand this scenario. My reading is that modprobe of A
would wait in resolve_symbol_wait() only if module B is in the coming state,
and then would get later woken up from finalize_running_load(). In all other
cases, resolve_symbol_wait() should not sleep:
* If B is not loaded at all or is in the unformed state then resolve_symbol()
  -> find_symbol() detects that a needed symbol is missing. NULL is then
  returned from resolve_symbol() and subsequently from resolve_symbol_wait().
* If B is live then resolve_symbol() should successfully resolve the symbol
  and a refcount is increased on this module.
* If B is going then resolve_symbol() -> ref_module() ->
  strong_try_module_get() -> try_module_get() should fail. This then gets
  propagated as -ENODEV from resolve_symbol() and resolve_symbol_wait().

>>  	return 0;
>>  out:
>>  	mutex_unlock(&module_mutex);
>> @@ -2552,43 +2540,133 @@ static int may_init_module(void)
>>  	return 0;
>>  }
>>  
>> +static struct shared_load_info *
>> +shared_load_info_alloc(const struct load_info *info)
>> +{
>> +	struct shared_load_info *shared_info =
>> +		kzalloc(sizeof(*shared_info), GFP_KERNEL);
>> +	if (shared_info == NULL)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	strscpy(shared_info->name, info->name, sizeof(shared_info->name));
>> +	refcount_set(&shared_info->refcnt, 1);
>> +	INIT_LIST_HEAD(&shared_info->list);
>> +	init_completion(&shared_info->done);
>> +	return shared_info;
>> +}
>> +
>> +static void shared_load_info_get(struct shared_load_info *shared_info)
>> +{
>> +	refcount_inc(&shared_info->refcnt);
>> +}
>> +
>> +static void shared_load_info_put(struct shared_load_info *shared_info)
>> +{
>> +	if (refcount_dec_and_test(&shared_info->refcnt))
>> +		kfree(shared_info);
>> +}
>> +
>>  /*
>> - * We try to place it in the list now to make sure it's unique before
>> - * we dedicate too many resources.  In particular, temporary percpu
>> + * Check that a module load is unique and make it visible to others. The code
>> + * looks for parallel running inserts and already loaded modules. Two inserts
>> + * are considered equivalent if their module name matches. In case this load
>> + * duplicates another running insert, the code waits for its completion and
>> + * then returns -EEXIST or -EBUSY depending on whether it succeeded.
>> + *
>> + * Detecting early that a load is unique avoids dedicating too many cycles and
>> + * resources to bring up the module. In particular, it prevents temporary percpu
>>   * memory exhaustion.
>> + *
>> + * Merging same load requests then primarily helps during the boot process. It
>> + * can happen that the kernel receives a burst of requests to load the same
>> + * module (for example, a same module for each individual CPU) and loading it
>> + * eventually fails during its init call. Merging the requests allows that only
>> + * one full attempt to load the module is made.
>> + *
>> + * On a non-error return, it is guaranteed that this load is unique.
>>   */
>> -static int add_unformed_module(struct module *mod)
>> +static struct shared_load_info *add_running_load(const struct load_info *info)
>>  {
>> -	int err;
>>  	struct module *old;
>> +	struct shared_load_info *shared_info;
>>  
>> -	mod->state = MODULE_STATE_UNFORMED;
>> -
>> -again:
>>  	mutex_lock(&module_mutex);
>> -	old = find_module_all(mod->name, strlen(mod->name), true);
>> -	if (old != NULL) {
>> -		if (old->state != MODULE_STATE_LIVE) {
>> -			/* Wait in case it fails to load. */
>> +
>> +	/* Search if there is a running load of a module with the same name. */
>> +	list_for_each_entry(shared_info, &running_loads, list)
>> +		if (strcmp(shared_info->name, info->name) == 0) {
>> +			int err;
>> +
>> +			shared_load_info_get(shared_info);
>>  			mutex_unlock(&module_mutex);
>> -			err = wait_event_interruptible(module_wq,
>> -					       finished_loading(mod->name));
>> -			if (err)
>> -				goto out_unlocked;
>> -			goto again;
>> +
>> +			err = wait_for_completion_interruptible(
>> +				&shared_info->done);
> 
> This would deserve a comment, for examle:
> 
> 			/*
> 			 * Return -EBUSY when the parallel load failed
> 			 * from any reason. This load might end up
> 			 * another way but we are not going to try.
> 			 */

Ok.

>> +			if (!err)
>> +				err = shared_info->err ? -EBUSY : -EEXIST;
>> +			shared_load_info_put(shared_info);
>> +			shared_info = ERR_PTR(err);
>> +			goto out_unlocked;
>>  		}
>> -		err = -EEXIST;
>> +
>> +	/* Search if there is a live module with the given name already. */
>> +	old = find_module_all(info->name, strlen(info->name), true);
>> +	if (old != NULL) {
>> +		if (old->state == MODULE_STATE_LIVE) {
>> +			shared_info = ERR_PTR(-EEXIST);
>> +			goto out;
>> +		}
>> +
>> +		/*
>> +		 * Any active load always has its record in running_loads and so
>> +		 * would be found above. This applies independent whether such
>> +		 * a module is currently in MODULE_STATE_UNFORMED,
>> +		 * MODULE_STATE_COMING, or even in MODULE_STATE_GOING if its
>> +		 * initialization failed. It therefore means this must be an
>> +		 * older going module and the caller should try later once it is
>> +		 * gone.
>> +		 */
>> +		WARN_ON(old->state != MODULE_STATE_GOING);
>> +		shared_info = ERR_PTR(-EBUSY);
>>  		goto out;
>>  	}
>> -	mod_update_bounds(mod);
>> -	list_add_rcu(&mod->list, &modules);
>> -	mod_tree_insert(mod);
>> -	err = 0;
>> +
>> +	/* The load is unique, make it visible to others. */
>> +	shared_info = shared_load_info_alloc(info);
>> +	if (IS_ERR(shared_info))
>> +		goto out;
>> +	list_add(&shared_info->list, &running_loads);
>>  
>>  out:
>>  	mutex_unlock(&module_mutex);
>>  out_unlocked:
>> -	return err;
>> +	return shared_info;
>> +}
>> +
>> +static void finalize_running_load(struct shared_load_info *shared_info, int err)
>> +{
>> +	/* Inform other duplicate inserts that the load finished. */
>> +	mutex_lock(&module_mutex);
>> +	list_del(&shared_info->list);
>> +	shared_info->err = err;
>> +	mutex_unlock(&module_mutex);
>> +
>> +	complete_all(&shared_info->done);
>> +	shared_load_info_put(shared_info);
>> +
>> +	/* Tell other modules waiting on this one that it completed loading. */
>> +	wake_up_interruptible(&module_wq);
>> +}
>> +
>> +static void add_unformed_module(struct module *mod)
>> +{
>> +	mod->state = MODULE_STATE_UNFORMED;
>> +
>> +	mutex_lock(&module_mutex);
>> +	mod_update_bounds(mod);
>> +	list_add_rcu(&mod->list, &modules);
>> +	mod_tree_insert(mod);
>> +	mutex_unlock(&module_mutex);
>>  }
>>  
>>  static int complete_formation(struct module *mod, struct load_info *info)
> 
> The comments are more or less about cosmetic problems. I do not see
> any real functional problem. I am sorry that I did not mention
> them when reviewing v1.
> 
> Feel free to use:
> 
> Reviewed-by: Petr Mladek <pmladek@xxxxxxxx>

Thanks,
Petr