Re: Potentially undesirable interactions between vfork() and time namespaces

"Eric W. Biederman" <ebiederm@xxxxxxxxxxxx> · Tue, 06 Sep 2022 17:16:05 -0500

Alexey Izbyshev <izbyshev@xxxxxxxxx> writes:

> On 2022-09-01 21:11, Eric W. Biederman wrote:
>> Andrei Vagin <avagin@xxxxxxxxx> writes:
>> 
>>> On Tue, Aug 30, 2022 at 6:18 PM Andrei Vagin <avagin@xxxxxxxxx> wrote:
>>>> On Tue, Aug 30, 2022 at 10:49:43PM +0300, Alexey Izbyshev wrote:
>>> <snip>
>>>>> @@ -1030,6 +1033,10 @@ static int exec_mmap(struct mm_struct *mm)
>>>>>         tsk->mm->vmacache_seqnum = 0;
>>>>>         vmacache_flush(tsk);
>>>>>         task_unlock(tsk);
>>>>> +
>>>>> +       if (vfork)
>>>>> +               timens_on_fork(tsk->nsproxy, tsk);
>>>>> +
>>>>> Similarly, even after a normal vfork(), time namespace switch could 
>>>>> be
>>>>> silently skipped if the parent dies before "tsk->vfork_done" is
>>>>> read. Again,
>>>>> I don't know whether anybody cares, but this behavior seems non-obvious and
>>>>> probably unintended to me.
>>>> This is the more interesting case. I will try to find out how we can
>>>> handle it properly.
>>> It might not be a good idea to use vfork_done in this case. Let's
>>> think about what we have and what we want to change. We don't want to
>>> allow switching timens if a process mm is used by someone else. But we
>>> forgot to handle execve that creates a new mm, and we can't change this
>>> behavior right now because it can affect current users. Right?
>> What we can't changes are things that will break existing programs.  If
>> existing programs don't care we can change the behavior of the kernel.
>> 
>>> So maybe the best choice, in this case, is to change behavior by adding
>>> a new control that enables it. The first interface that comes to my mind
>>> is to introduce a new ioctl for a namespace file descriptor. Here is a
>>> draft patch below that should help to understand what I mean.
>> I don't think adding a new control works, because programs that are
>> calling vfork or posix_spawn today will stop working.
>> We should recognize that basing things off of CLONE_VFORK was a bad 
>> idea
>> as CLONE_VFORK is all about waiting for the created task to exec or
>> exit, and really has nothing to do with creating a new mm.
>> Instead I think the rule should be that a new time namespaces is
>> installed as soon as we have a new mm.
>> That will be a behavioral change if the time ns is unshared and then 
>> the
>> program exec's instead of forking children, but I suspect it is the
>> proper behavior all the same, and that existing userspace won't care.
>> Especially since all of the vfork_done work is new behavior as
>> of v6.0-rc1.
>> 
> While vfork_done work is indeed new, preservation of time_ns_for_children on
> execve() instead of switching to it is how time namespaces were originally
> implemented in 5.6. If this can be changed even now, thereby fixing the original
> design, that's great, I just want to point out that it's not the recent 6.0 work
> that is being fixed. Fixes/clarifications for man pages[1][2], which talk about
> "subsequently created children", will also be needed.
>
> [1] https://man7.org/linux/man-pages/man7/time_namespaces.7.html
> [2] https://man7.org/linux/man-pages/man2/unshare.2.html

Sorry, yes.

That is something to be double checked.

I can't see where it would make sense to unshare a time namespace and
then call exec, instead of calling exit.  So I suspect we can just
change this behavior and no one will notice. 

>> Ugh.  I just spotted another bug.  The function timens_on_fork as
>> written is not safe to call without first creating a fresh copy
>> of the nsproxy, and we don't do that during exec.  Because nsproxy
>> is shared between tasks and processes updating the values needs to
>> create a new nsproxy or other tasks/processes can be affected.
>> Not hard to handle just something that needs to be addressed.
>> Say something like this:
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 9a5ca7b82bfc..8a6947e631dd 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -979,12 +979,10 @@ static int exec_mmap(struct mm_struct *mm)
>>  {
>>  	struct task_struct *tsk;
>>  	struct mm_struct *old_mm, *active_mm;
>> -	bool vfork;
>>  	int ret;
>>  	/* Notify parent that we're no longer interested in the old VM */
>>  	tsk = current;
>> -	vfork = !!tsk->vfork_done;
>>  	old_mm = current->mm;
>>  	exec_mm_release(tsk, old_mm);
>>  	if (old_mm)
>> @@ -1030,9 +1028,6 @@ static int exec_mmap(struct mm_struct *mm)
>>  	vmacache_flush(tsk);
>>  	task_unlock(tsk);
>> -	if (vfork)
>> -		timens_on_fork(tsk->nsproxy, tsk);
>> -
>>  	if (old_mm) {
>>  		mmap_read_unlock(old_mm);
>>  		BUG_ON(active_mm != old_mm);
>> @@ -1303,6 +1298,10 @@ int begin_new_exec(struct linux_binprm * bprm)
>>  	bprm->mm = NULL;
>> +	retval = exec_task_namespaces();
>> +	if (retval)
>> +		goto out_unlock;
>> +
>>  #ifdef CONFIG_POSIX_TIMERS
>>  	spin_lock_irq(&me->sighand->siglock);
>>  	posix_cpu_timers_exit(me);
>> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
>> index cdb171efc7cb..fee881cded01 100644
>> --- a/include/linux/nsproxy.h
>> +++ b/include/linux/nsproxy.h
>> @@ -94,6 +94,7 @@ static inline struct cred *nsset_cred(struct nsset *set)
>>  int copy_namespaces(unsigned long flags, struct task_struct *tsk);
>>  void exit_task_namespaces(struct task_struct *tsk);
>>  void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
>> +int exec_task_namespaces(void);
>>  void free_nsproxy(struct nsproxy *ns);
>>  int unshare_nsproxy_namespaces(unsigned long, struct nsproxy **,
>>  	struct cred *, struct fs_struct *);
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 90c85b17bf69..b4a799d9c50f 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -2043,18 +2043,6 @@ static __latent_entropy struct task_struct
>> *copy_process(
>>  			return ERR_PTR(-EINVAL);
>>  	}
>> -	/*
>> -	 * If the new process will be in a different time namespace
>> -	 * do not allow it to share VM or a thread group with the forking task.
>> -	 *
>> -	 * On vfork, the child process enters the target time namespace only
>> -	 * after exec.
>> -	 */
>> -	if ((clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM) {
>> -		if (nsp->time_ns != nsp->time_ns_for_children)
>> -			return ERR_PTR(-EINVAL);
>> -	}
>> -
>>  	if (clone_flags & CLONE_PIDFD) {
>>  		/*
>>  		 * - CLONE_DETACHED is blocked so that we can potentially
>> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
>> index b4cbb406bc28..b6647846fe42 100644
>> --- a/kernel/nsproxy.c
>> +++ b/kernel/nsproxy.c
>> @@ -255,6 +255,24 @@ void exit_task_namespaces(struct task_struct *p)
>>  	switch_task_namespaces(p, NULL);
>>  }
>> +int exec_task_namespaces(void)
>> +{
>> +	struct task_struct *tsk = current;
>> +	struct nsproxy *new;
>> +
>> +	if (tsk->nsproxy->time_ns_for_children == tsk->nsproxy->time_ns)
>> +		return 0;
>> +
>> +	new = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
>> +	if (IS_ERR(new))
>> +		return PTR_ERR(new);
>> +
>> +	timens_on_fork(new, tsk);
>> +	switch_task_namespaces(tsk, new);
>> +	return 0;
>> +}
>> +
>> +
>>  static int check_setns_flags(unsigned long flags)
>>  {
>>  	if (!flags || (flags & ~(CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
>> 
>> To keep things from being too confusing it probably makes sense to
>> rename the nsproxy variable from time_ns_for_children to
>> time_ns_for_new_mm.  Likewise timens_on_fork can be renamed
>> timens_on_new_mm.
>> 
> Do you imply renaming "/proc/[pid]/ns/time_for_children" as well, or will it be
> preserved for compatibility?

Unfortunately I don't think we can change that one.  We could add
another better named one, update the tools to use it.  Then wait a
couple of millenia and remove the current name.  Depending it might be
worth it, but only if you have a lot of patience.

We should get the implementation details sorted out first, and the
in-kernel name before touching the proc files.

Eric