Re: [PATCH] [RFC] rust: add PidNamespace wrapper

Alice Ryhl <aliceryhl@xxxxxxxxxx> · Fri, 27 Sep 2024 16:58:20 +0200

On Fri, Sep 27, 2024 at 4:21 PM Christian Brauner <brauner@xxxxxxxxxx> wrote:
>
> On Fri, Sep 27, 2024 at 02:04:13PM GMT, Alice Ryhl wrote:
> > On Thu, Sep 26, 2024 at 6:36 PM Christian Brauner <brauner@xxxxxxxxxx> wrote:
> > >
> > > Ok, so here's my feeble attempt at getting something going for wrapping
> > > struct pid_namespace as struct pid_namespace indirectly came up in the
> > > file abstraction thread.
> >
> > This looks great!
>
> Thanks!
>
> >
> > > The lifetime of a pid namespace is intimately tied to the lifetime of
> > > task. The pid namespace of a task doesn't ever change. A
> > > unshare(CLONE_NEWPID) or setns(fd_pidns/pidfd, CLONE_NEWPID) will not
> > > change the task's pid namespace only the pid namespace of children
> > > spawned by the task. This invariant is important to keep in mind.
> > >
> > > After a task is reaped it will be detached from its associated struct
> > > pids via __unhash_process(). This will also set task->thread_pid to
> > > NULL.
> > >
> > > In order to retrieve the pid namespace of a task task_active_pid_ns()
> > > can be used. The helper works on both current and non-current taks but
> > > the requirements are slightly different in both cases and it depends on
> > > where the helper is called.
> > >
> > > The rules for this are simple but difficult for me to translate into
> > > Rust. If task_active_pid_ns() is called on current then no RCU locking
> > > is needed as current is obviously alive. On the other hand calling
> > > task_active_pid_ns() after release_task() would work but it would mean
> > > task_active_pid_ns() will return NULL.
> > >
> > > Calling task_active_pid_ns() on a non-current task, while valid, must be
> > > under RCU or other protection mechanism as the task might be
> > > release_task() and thus in __unhash_process().
> >
> > Just to confirm, calling task_active_pid_ns() on a non-current task
> > requires the rcu lock even if you own a refcont on the task?
>
> Interesting question. Afaik, yes. task_active_pid_ns() goes via
> task->thread_pid which is a shorthand for task->pid_links[PIDTYPE_PID].
>
> This will be NULLed when the task exits and is dead (so usually when
> someone has waited on it - ignoring ptrace for sanity reasons and
> autoreaping the latter amounts to the same thing just in-kernel):
>
> T1                      T2                                                   T3
> exit(0);
>                         wait(T1)
>                         -> wait_task_zombie()
>                            -> release_task()
>                               -> __exit_signals()
>                                  -> __unash_process()
>                                     // sets task->thread_pid == NULL         task_active_pid_ns(T1)
>                                     // task->pid_links[PIDTYPE_PID] == NULL
>
> So having a reference to struct task_struct doesn't prevent
> task->thread_pid becoming NULL.
>
> And you touch upon a very interesting point. The lifetime of struct
> pid_namespace is actually tied to struct pid much tighter than it is to
> struct task_struct. So when a task is released (transitions from zombie
> to dead in the common case) the following happens:
>
> release_task()
> -> __exit_signals()
>    -> thread_pid = get_pid(task->thread_pid)
>       -> __unhash_process()
>          -> detach_pid(PIDTYPE_PID)
>             -> __change_pid()
>                {
>                        task->thread_pid = NULL;
>                        task->pid_links[PIDTYPE_PID] = NULL;
>                        free_pid(thread_pid)
>                }
>          put_pid(thread_pid)
>
> And the free_pid() in __change_pid() does a delayed_put_pid() via
> call_rcu().
>
> So afaiu, taking the rcu_read_lock() synchronizes against that
> delayed_put_pid() in __change_pid() so the call_rcu() will wait until
> everyone who does
>
> rcu_read_lock()
> task_active_pid_ns(task)
> rcu_read_unlock()
>
> and sees task->thread_pid non-NULL, is done. This way no additional
> reference count on struct task_struct or struct pid is needed before
> plucking the pid namespace from there. Does that make sense or have I
> gotten it all wrong?

Okay. I agree that the code you have is the best we can do; at least
until we get an rcu guard in Rust.

The macro doesn't quite work. You need to do something to constrain
the lifetime used by `PidNamespace::from_ptr`. Right now, there is no
constraint on the lifetime, so the caller can just pick the lifetime
'static which is the lifetime that never ends. We want to constrain it
to a lifetime that ends before the task dies. The easiest is to create
a local variable and use the lifetime of that local variable. That
way, the reference can never escape the current function, and hence,
can't escape the current task.

More generally, I'm sure there are lots of fields in current where we
can access them without rcu only because we know the current task
isn't going to die on us. I don't think we should have a macro for
every single one. I think we can put together a single macro for
getting a lifetime that ends before returning to userspace, and then
reuse that lifetime for both `current` and `current_pid_ns`, and
possibly also the `DeferredFd` patch.

Alice