Re: [PATCH v9] pidfd: add ioctl to retrieve pid info

Christian Brauner <brauner@xxxxxxxxxx> · Thu, 10 Oct 2024 11:38:32 +0200

On Thu, Oct 10, 2024 at 07:52:20AM +1100, Aleksa Sarai wrote:
> On 2024-10-08, Luca Boccassi <luca.boccassi@xxxxxxxxx> wrote:
> > On Tue, 8 Oct 2024 at 14:06, Christian Brauner <brauner@xxxxxxxxxx> wrote:
> > >
> > > On Tue, Oct 08, 2024 at 01:18:20PM GMT, luca.boccassi@xxxxxxxxx wrote:
> > > > From: Luca Boccassi <luca.boccassi@xxxxxxxxx>
> > > >
> > > > A common pattern when using pid fds is having to get information
> > > > about the process, which currently requires /proc being mounted,
> > > > resolving the fd to a pid, and then do manual string parsing of
> > > > /proc/N/status and friends. This needs to be reimplemented over
> > > > and over in all userspace projects (e.g.: I have reimplemented
> > > > resolving in systemd, dbus, dbus-daemon, polkit so far), and
> > > > requires additional care in checking that the fd is still valid
> > > > after having parsed the data, to avoid races.
> > > >
> > > > Having a programmatic API that can be used directly removes all
> > > > these requirements, including having /proc mounted.
> > > >
> > > > As discussed at LPC24, add an ioctl with an extensible struct
> > > > so that more parameters can be added later if needed. Start with
> > > > returning pid/tgid/ppid and creds unconditionally, and cgroupid
> > > > optionally.
> > > >
> > > > Signed-off-by: Luca Boccassi <luca.boccassi@xxxxxxxxx>
> > > > ---
> > > > v9: drop result_mask and reuse request_mask instead
> > > > v8: use RAII guard for rcu, call put_cred()
> > > > v7: fix RCU issue and style issue introduced by v6 found by reviewer
> > > > v6: use rcu_read_lock() when fetching cgroupid, use task_ppid_nr_ns() to
> > > >     get the ppid, return ESCHR if any of pid/tgid/ppid are 0 at the end
> > > >     of the call to avoid providing incomplete data, document what the
> > > >     callers should expect
> > > > v5: check again that the task hasn't exited immediately before copying
> > > >     the result out to userspace, to ensure we are not returning stale data
> > > >     add an ifdef around the cgroup structs usage to fix build errors when
> > > >     the feature is disabled
> > > > v4: fix arg check in pidfd_ioctl() by moving it after the new call
> > > > v3: switch from pid_vnr() to task_pid_vnr()
> > > > v2: Apply comments from Christian, apart from the one about pid namespaces
> > > >     as I need additional hints on how to implement it.
> > > >     Drop the security_context string as it is not the appropriate
> > > >     metadata to give userspace these days.
> > > >
> > > >  fs/pidfs.c                                    | 88 ++++++++++++++++++-
> > > >  include/uapi/linux/pidfd.h                    | 30 +++++++
> > > >  .../testing/selftests/pidfd/pidfd_open_test.c | 80 ++++++++++++++++-
> > > >  3 files changed, 194 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/fs/pidfs.c b/fs/pidfs.c
> > > > index 80675b6bf884..15cdc7fe4968 100644
> > > > --- a/fs/pidfs.c
> > > > +++ b/fs/pidfs.c
> > > > @@ -2,6 +2,7 @@
> > > >  #include <linux/anon_inodes.h>
> > > >  #include <linux/file.h>
> > > >  #include <linux/fs.h>
> > > > +#include <linux/cgroup.h>
> > > >  #include <linux/magic.h>
> > > >  #include <linux/mount.h>
> > > >  #include <linux/pid.h>
> > > > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
> > > >       return poll_flags;
> > > >  }
> > > >
> > > > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg)
> > > > +{
> > > > +     struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
> > > > +     size_t usize = _IOC_SIZE(cmd);
> > > > +     struct pidfd_info kinfo = {};
> > > > +     struct user_namespace *user_ns;
> > > > +     const struct cred *c;
> > > > +     __u64 request_mask;
> > > > +
> > > > +     if (!uinfo)
> > > > +             return -EINVAL;
> > > > +     if (usize < sizeof(struct pidfd_info))
> > > > +             return -EINVAL; /* First version, no smaller struct possible */
> > > > +
> > > > +     if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask)))
> > > > +             return -EFAULT;
> > > > +
> > > > +     c = get_task_cred(task);
> > > > +     if (!c)
> > > > +             return -ESRCH;
> > > > +
> > > > +     /* Unconditionally return identifiers and credentials, the rest only on request */
> > > > +
> > > > +     user_ns = current_user_ns();
> > > > +     kinfo.ruid = from_kuid_munged(user_ns, c->uid);
> > > > +     kinfo.rgid = from_kgid_munged(user_ns, c->gid);
> > > > +     kinfo.euid = from_kuid_munged(user_ns, c->euid);
> > > > +     kinfo.egid = from_kgid_munged(user_ns, c->egid);
> > > > +     kinfo.suid = from_kuid_munged(user_ns, c->suid);
> > > > +     kinfo.sgid = from_kgid_munged(user_ns, c->sgid);
> > > > +     kinfo.fsuid = from_kuid_munged(user_ns, c->fsuid);
> > > > +     kinfo.fsgid = from_kgid_munged(user_ns, c->fsgid);
> > > > +     put_cred(c);
> > > > +
> > > > +#ifdef CONFIG_CGROUPS
> > > > +     if (request_mask & PIDFD_INFO_CGROUPID) {
> > > > +             struct cgroup *cgrp;
> > > > +
> > > > +             guard(rcu)();
> > > > +             cgrp = task_cgroup(task, pids_cgrp_id);
> > > > +             if (!cgrp)
> > > > +                     return -ENODEV;
> > >
> > > Afaict this means that the task has already exited. In other words, the
> > > cgroup id cannot be retrieved anymore for a task that has exited but not
> > > been reaped. Frankly, I would have expected the cgroup id to be
> > > retrievable until the task has been reaped but that's another
> > > discussion.
> > >
> > > My point is if you contrast this with the other information in here: If
> > > the task has exited but hasn't been reaped then you can still get
> > > credentials such as *uid/*gid, and pid namespace relative information
> > > such as pid/tgid/ppid.
> > >
> > > So really, I would argue that you don't want to fail this but only
> > > report 0 here. That's me working under the assumption that cgroup ids
> > > start from 1...
> > >
> > > /me checks
> > >
> > > Yes, they start from 1 so 0 is invalid.
> > >
> > > > +             kinfo.cgroupid = cgroup_id(cgrp);
> > >
> > > Fwiw, it looks like getting the cgroup id is basically just
> > > dereferencing pointers without having to hold any meaningful locks. So
> > > it should be fast. So making it unconditional seems fine to me.
> > 
> > There's an ifdef since it's an optional kernel feature, and there's
> > also this thing that we might not have it, so I'd rather keep the
> > flag, and set it only if we can get the information (instead of
> > failing). As a user that seems clearer to me.
> 
> I think we should keep the request_mask flag when returning from the
> kernel, but it's not necessary for the user to request it explicitly
> because it's cheap to get. This is how STATX_MNT_ID works and I think it
> makes sense to do it that way.

So what we should do is not require userspace to request it explicitly
but always raise the flag in request_mask when it's available. I agree.