Re: [PATCH 6/7] exec: Move most of setup_new_exec into flush_old_exec

Kees Cook <keescook@xxxxxxxxxxxx> · Tue, 5 May 2020 14:29:21 -0700

On Tue, May 05, 2020 at 02:45:33PM -0500, Eric W. Biederman wrote:
> 
> The current idiom for the callers is:
> 
> flush_old_exec(bprm);
> set_personality(...);
> setup_new_exec(bprm);
> 
> In 2010 Linus split flush_old_exec into flush_old_exec and
> setup_new_exec.  With the intention that setup_new_exec be what is
> called after the processes new personality is set.
> 
> Move the code that doesn't depend upon the personality from
> setup_new_exec into flush_old_exec.  This is to facilitate future
> changes by having as much code together in one function as possible.

Er, I *think* this is okay, but I have some questions below which
maybe you already investigated (and should perhaps get called out in
the changelog).

> 
> Ref: 221af7f87b97 ("Split 'flush_old_exec' into two functions")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
> ---
>  fs/exec.c | 85 ++++++++++++++++++++++++++++---------------------------
>  1 file changed, 44 insertions(+), 41 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 8c3abafb9bb1..0eff20558735 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1359,39 +1359,7 @@ int flush_old_exec(struct linux_binprm * bprm)
>  	 * undergoing exec(2).
>  	 */
>  	do_close_on_exec(me->files);
> -	return 0;
> -
> -out_unlock:
> -	mutex_unlock(&me->signal->exec_update_mutex);
> -out:
> -	return retval;
> -}
> -EXPORT_SYMBOL(flush_old_exec);
> -
> -void would_dump(struct linux_binprm *bprm, struct file *file)
> -{
> -	struct inode *inode = file_inode(file);
> -	if (inode_permission(inode, MAY_READ) < 0) {
> -		struct user_namespace *old, *user_ns;
> -		bprm->interp_flags |= BINPRM_FLAGS_ENFORCE_NONDUMP;
> -
> -		/* Ensure mm->user_ns contains the executable */
> -		user_ns = old = bprm->mm->user_ns;
> -		while ((user_ns != &init_user_ns) &&
> -		       !privileged_wrt_inode_uidgid(user_ns, inode))
> -			user_ns = user_ns->parent;
>  
> -		if (old != user_ns) {
> -			bprm->mm->user_ns = get_user_ns(user_ns);
> -			put_user_ns(old);
> -		}
> -	}
> -}
> -EXPORT_SYMBOL(would_dump);
> -
> -void setup_new_exec(struct linux_binprm * bprm)
> -{
> -	struct task_struct *me = current;
>  	/*
>  	 * Once here, prepare_binrpm() will not be called any more, so
>  	 * the final state of setuid/setgid/fscaps can be merged into the
> @@ -1414,8 +1382,6 @@ void setup_new_exec(struct linux_binprm * bprm)
>  			bprm->rlim_stack.rlim_cur = _STK_LIM;
>  	}
>  
> -	arch_pick_mmap_layout(me->mm, &bprm->rlim_stack);
> -
>  	me->sas_ss_sp = me->sas_ss_size = 0;
>  
>  	/*
> @@ -1430,16 +1396,9 @@ void setup_new_exec(struct linux_binprm * bprm)
>  	else
>  		set_dumpable(current->mm, SUID_DUMP_USER);
>  
> -	arch_setup_new_exec();
>  	perf_event_exec();

What is perf expecting to be able to examine at this point? Does it want
a view of things after arch_setup_new_exec()? (i.e. "final" TIF flags,
mmap layout, etc.) From what I can, the answer is "no, it's just
resetting counters", so I think this is fine. Maybe double-check with
Steve?

>  	__set_task_comm(me, kbasename(bprm->filename), true);
>  
> -	/* Set the new mm task size. We have to do that late because it may
> -	 * depend on TIF_32BIT which is only updated in flush_thread() on
> -	 * some architectures like powerpc
> -	 */
> -	me->mm->task_size = TASK_SIZE;
> -
>  	/* An exec changes our domain. We are no longer part of the thread
>  	   group */
>  	WRITE_ONCE(me->self_exec_id, me->self_exec_id + 1);
> @@ -1467,6 +1426,50 @@ void setup_new_exec(struct linux_binprm * bprm)
>  	 * credentials; any time after this it may be unlocked.
>  	 */
>  	security_bprm_committed_creds(bprm);

Similarly for the LSM hook: is it expecting a post-arch-setup view? I
don't see anything looking at task_size, TIF flags, or anything else;
they seem to be just cleaning up from the old process being replaced, so
against, I think this is okay.

Not visible in this patch, the following things how happen earlier,
which I feel should maybe get called out in the changelog, with,
perhaps, better justification than what I've got here:

bprm->secureexec set/check (looks safe, since it depends on
prepare_binprm()'s security_bprm_set_creds().

rlim_stack.rlim_cur setting (safe, just needs to happen before
arch_pick_mmap_layout())

dumpable() check (looks safe, BINPRM_FLAGS_ENFORCE_NONDUMP depends on
much earlier would_dump(), and uid/gid depend on earlier calls to
prepare_binprm()'s bprm_fill_uid())

__set_task_comm (looks safe, just dealing with the task name...)

self_exec_id bump (looks safe, but I think -- it's still after uid
setting)

flush_signal_handlers() (looks safe -- nothing appears to depend on mm
nor personality)

> +	return 0;
> +
> +out_unlock:
> +	mutex_unlock(&me->signal->exec_update_mutex);
> +out:
> +	return retval;
> +}
> +EXPORT_SYMBOL(flush_old_exec);
> +
> +void would_dump(struct linux_binprm *bprm, struct file *file)
> +{
> +	struct inode *inode = file_inode(file);
> +	if (inode_permission(inode, MAY_READ) < 0) {
> +		struct user_namespace *old, *user_ns;
> +		bprm->interp_flags |= BINPRM_FLAGS_ENFORCE_NONDUMP;
> +
> +		/* Ensure mm->user_ns contains the executable */
> +		user_ns = old = bprm->mm->user_ns;
> +		while ((user_ns != &init_user_ns) &&
> +		       !privileged_wrt_inode_uidgid(user_ns, inode))
> +			user_ns = user_ns->parent;
> +
> +		if (old != user_ns) {
> +			bprm->mm->user_ns = get_user_ns(user_ns);
> +			put_user_ns(old);
> +		}
> +	}
> +}
> +EXPORT_SYMBOL(would_dump);

The diff helpfully decided this moved would_dump(). ;) Is it worth
maybe just moviing it explicitly above flush_old_exec() to avoid this
churn? I dunno.

> +
> +void setup_new_exec(struct linux_binprm * bprm)
> +{
> +	/* Setup things that can depend upon the personality */

Should this comment be above the function instead?

> +	struct task_struct *me = current;
> +
> +	arch_pick_mmap_layout(me->mm, &bprm->rlim_stack);
> +
> +	arch_setup_new_exec();
> +
> +	/* Set the new mm task size. We have to do that late because it may
> +	 * depend on TIF_32BIT which is only updated in flush_thread() on
> +	 * some architectures like powerpc
> +	 */
> +	me->mm->task_size = TASK_SIZE;
>  	mutex_unlock(&me->signal->exec_update_mutex);
>  	mutex_unlock(&me->signal->cred_guard_mutex);
>  }
> -- 
> 2.20.1
> 

So, as I say, I *think* this is okay, but I always get suspicious about
reordering things in execve(). ;)

So, with a bit larger changelog discussing what's moving "earlier",
I think this looks good:

Reviewed-by: Kees Cook <keescook@xxxxxxxxxxxx>

-- 
Kees Cook