Alexey Izbyshev <izbyshev@xxxxxxxxx> writes: > On 2022-09-01 21:11, Eric W. Biederman wrote: >> Andrei Vagin <avagin@xxxxxxxxx> writes: >> >>> On Tue, Aug 30, 2022 at 6:18 PM Andrei Vagin <avagin@xxxxxxxxx> wrote: >>>> On Tue, Aug 30, 2022 at 10:49:43PM +0300, Alexey Izbyshev wrote: >>> <snip> >>>>> @@ -1030,6 +1033,10 @@ static int exec_mmap(struct mm_struct *mm) >>>>> tsk->mm->vmacache_seqnum = 0; >>>>> vmacache_flush(tsk); >>>>> task_unlock(tsk); >>>>> + >>>>> + if (vfork) >>>>> + timens_on_fork(tsk->nsproxy, tsk); >>>>> + >>>>> Similarly, even after a normal vfork(), time namespace switch could >>>>> be >>>>> silently skipped if the parent dies before "tsk->vfork_done" is >>>>> read. Again, >>>>> I don't know whether anybody cares, but this behavior seems non-obvious and >>>>> probably unintended to me. >>>> This is the more interesting case. I will try to find out how we can >>>> handle it properly. >>> It might not be a good idea to use vfork_done in this case. Let's >>> think about what we have and what we want to change. We don't want to >>> allow switching timens if a process mm is used by someone else. But we >>> forgot to handle execve that creates a new mm, and we can't change this >>> behavior right now because it can affect current users. Right? >> What we can't changes are things that will break existing programs. If >> existing programs don't care we can change the behavior of the kernel. >> >>> So maybe the best choice, in this case, is to change behavior by adding >>> a new control that enables it. The first interface that comes to my mind >>> is to introduce a new ioctl for a namespace file descriptor. Here is a >>> draft patch below that should help to understand what I mean. >> I don't think adding a new control works, because programs that are >> calling vfork or posix_spawn today will stop working. >> We should recognize that basing things off of CLONE_VFORK was a bad >> idea >> as CLONE_VFORK is all about waiting for the created task to exec or >> exit, and really has nothing to do with creating a new mm. >> Instead I think the rule should be that a new time namespaces is >> installed as soon as we have a new mm. >> That will be a behavioral change if the time ns is unshared and then >> the >> program exec's instead of forking children, but I suspect it is the >> proper behavior all the same, and that existing userspace won't care. >> Especially since all of the vfork_done work is new behavior as >> of v6.0-rc1. >> > While vfork_done work is indeed new, preservation of time_ns_for_children on > execve() instead of switching to it is how time namespaces were originally > implemented in 5.6. If this can be changed even now, thereby fixing the original > design, that's great, I just want to point out that it's not the recent 6.0 work > that is being fixed. Fixes/clarifications for man pages[1][2], which talk about > "subsequently created children", will also be needed. > > [1] https://man7.org/linux/man-pages/man7/time_namespaces.7.html > [2] https://man7.org/linux/man-pages/man2/unshare.2.html Sorry, yes. That is something to be double checked. I can't see where it would make sense to unshare a time namespace and then call exec, instead of calling exit. So I suspect we can just change this behavior and no one will notice. >> Ugh. I just spotted another bug. The function timens_on_fork as >> written is not safe to call without first creating a fresh copy >> of the nsproxy, and we don't do that during exec. Because nsproxy >> is shared between tasks and processes updating the values needs to >> create a new nsproxy or other tasks/processes can be affected. >> Not hard to handle just something that needs to be addressed. >> Say something like this: >> diff --git a/fs/exec.c b/fs/exec.c >> index 9a5ca7b82bfc..8a6947e631dd 100644 >> --- a/fs/exec.c >> +++ b/fs/exec.c >> @@ -979,12 +979,10 @@ static int exec_mmap(struct mm_struct *mm) >> { >> struct task_struct *tsk; >> struct mm_struct *old_mm, *active_mm; >> - bool vfork; >> int ret; >> /* Notify parent that we're no longer interested in the old VM */ >> tsk = current; >> - vfork = !!tsk->vfork_done; >> old_mm = current->mm; >> exec_mm_release(tsk, old_mm); >> if (old_mm) >> @@ -1030,9 +1028,6 @@ static int exec_mmap(struct mm_struct *mm) >> vmacache_flush(tsk); >> task_unlock(tsk); >> - if (vfork) >> - timens_on_fork(tsk->nsproxy, tsk); >> - >> if (old_mm) { >> mmap_read_unlock(old_mm); >> BUG_ON(active_mm != old_mm); >> @@ -1303,6 +1298,10 @@ int begin_new_exec(struct linux_binprm * bprm) >> bprm->mm = NULL; >> + retval = exec_task_namespaces(); >> + if (retval) >> + goto out_unlock; >> + >> #ifdef CONFIG_POSIX_TIMERS >> spin_lock_irq(&me->sighand->siglock); >> posix_cpu_timers_exit(me); >> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h >> index cdb171efc7cb..fee881cded01 100644 >> --- a/include/linux/nsproxy.h >> +++ b/include/linux/nsproxy.h >> @@ -94,6 +94,7 @@ static inline struct cred *nsset_cred(struct nsset *set) >> int copy_namespaces(unsigned long flags, struct task_struct *tsk); >> void exit_task_namespaces(struct task_struct *tsk); >> void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new); >> +int exec_task_namespaces(void); >> void free_nsproxy(struct nsproxy *ns); >> int unshare_nsproxy_namespaces(unsigned long, struct nsproxy **, >> struct cred *, struct fs_struct *); >> diff --git a/kernel/fork.c b/kernel/fork.c >> index 90c85b17bf69..b4a799d9c50f 100644 >> --- a/kernel/fork.c >> +++ b/kernel/fork.c >> @@ -2043,18 +2043,6 @@ static __latent_entropy struct task_struct >> *copy_process( >> return ERR_PTR(-EINVAL); >> } >> - /* >> - * If the new process will be in a different time namespace >> - * do not allow it to share VM or a thread group with the forking task. >> - * >> - * On vfork, the child process enters the target time namespace only >> - * after exec. >> - */ >> - if ((clone_flags & (CLONE_VM | CLONE_VFORK)) == CLONE_VM) { >> - if (nsp->time_ns != nsp->time_ns_for_children) >> - return ERR_PTR(-EINVAL); >> - } >> - >> if (clone_flags & CLONE_PIDFD) { >> /* >> * - CLONE_DETACHED is blocked so that we can potentially >> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c >> index b4cbb406bc28..b6647846fe42 100644 >> --- a/kernel/nsproxy.c >> +++ b/kernel/nsproxy.c >> @@ -255,6 +255,24 @@ void exit_task_namespaces(struct task_struct *p) >> switch_task_namespaces(p, NULL); >> } >> +int exec_task_namespaces(void) >> +{ >> + struct task_struct *tsk = current; >> + struct nsproxy *new; >> + >> + if (tsk->nsproxy->time_ns_for_children == tsk->nsproxy->time_ns) >> + return 0; >> + >> + new = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs); >> + if (IS_ERR(new)) >> + return PTR_ERR(new); >> + >> + timens_on_fork(new, tsk); >> + switch_task_namespaces(tsk, new); >> + return 0; >> +} >> + >> + >> static int check_setns_flags(unsigned long flags) >> { >> if (!flags || (flags & ~(CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | >> >> To keep things from being too confusing it probably makes sense to >> rename the nsproxy variable from time_ns_for_children to >> time_ns_for_new_mm. Likewise timens_on_fork can be renamed >> timens_on_new_mm. >> > Do you imply renaming "/proc/[pid]/ns/time_for_children" as well, or will it be > preserved for compatibility? Unfortunately I don't think we can change that one. We could add another better named one, update the tools to use it. Then wait a couple of millenia and remove the current name. Depending it might be worth it, but only if you have a lot of patience. We should get the implementation details sorted out first, and the in-kernel name before touching the proc files. Eric