On Tue, 8 Oct 2024 at 14:06, Christian Brauner <brauner@xxxxxxxxxx> wrote: > > On Tue, Oct 08, 2024 at 01:18:20PM GMT, luca.boccassi@xxxxxxxxx wrote: > > From: Luca Boccassi <luca.boccassi@xxxxxxxxx> > > > > A common pattern when using pid fds is having to get information > > about the process, which currently requires /proc being mounted, > > resolving the fd to a pid, and then do manual string parsing of > > /proc/N/status and friends. This needs to be reimplemented over > > and over in all userspace projects (e.g.: I have reimplemented > > resolving in systemd, dbus, dbus-daemon, polkit so far), and > > requires additional care in checking that the fd is still valid > > after having parsed the data, to avoid races. > > > > Having a programmatic API that can be used directly removes all > > these requirements, including having /proc mounted. > > > > As discussed at LPC24, add an ioctl with an extensible struct > > so that more parameters can be added later if needed. Start with > > returning pid/tgid/ppid and creds unconditionally, and cgroupid > > optionally. > > > > Signed-off-by: Luca Boccassi <luca.boccassi@xxxxxxxxx> > > --- > > v9: drop result_mask and reuse request_mask instead > > v8: use RAII guard for rcu, call put_cred() > > v7: fix RCU issue and style issue introduced by v6 found by reviewer > > v6: use rcu_read_lock() when fetching cgroupid, use task_ppid_nr_ns() to > > get the ppid, return ESCHR if any of pid/tgid/ppid are 0 at the end > > of the call to avoid providing incomplete data, document what the > > callers should expect > > v5: check again that the task hasn't exited immediately before copying > > the result out to userspace, to ensure we are not returning stale data > > add an ifdef around the cgroup structs usage to fix build errors when > > the feature is disabled > > v4: fix arg check in pidfd_ioctl() by moving it after the new call > > v3: switch from pid_vnr() to task_pid_vnr() > > v2: Apply comments from Christian, apart from the one about pid namespaces > > as I need additional hints on how to implement it. > > Drop the security_context string as it is not the appropriate > > metadata to give userspace these days. > > > > fs/pidfs.c | 88 ++++++++++++++++++- > > include/uapi/linux/pidfd.h | 30 +++++++ > > .../testing/selftests/pidfd/pidfd_open_test.c | 80 ++++++++++++++++- > > 3 files changed, 194 insertions(+), 4 deletions(-) > > > > diff --git a/fs/pidfs.c b/fs/pidfs.c > > index 80675b6bf884..15cdc7fe4968 100644 > > --- a/fs/pidfs.c > > +++ b/fs/pidfs.c > > @@ -2,6 +2,7 @@ > > #include <linux/anon_inodes.h> > > #include <linux/file.h> > > #include <linux/fs.h> > > +#include <linux/cgroup.h> > > #include <linux/magic.h> > > #include <linux/mount.h> > > #include <linux/pid.h> > > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) > > return poll_flags; > > } > > > > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg) > > +{ > > + struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg; > > + size_t usize = _IOC_SIZE(cmd); > > + struct pidfd_info kinfo = {}; > > + struct user_namespace *user_ns; > > + const struct cred *c; > > + __u64 request_mask; > > + > > + if (!uinfo) > > + return -EINVAL; > > + if (usize < sizeof(struct pidfd_info)) > > + return -EINVAL; /* First version, no smaller struct possible */ > > + > > + if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask))) > > + return -EFAULT; > > + > > + c = get_task_cred(task); > > + if (!c) > > + return -ESRCH; > > + > > + /* Unconditionally return identifiers and credentials, the rest only on request */ > > + > > + user_ns = current_user_ns(); > > + kinfo.ruid = from_kuid_munged(user_ns, c->uid); > > + kinfo.rgid = from_kgid_munged(user_ns, c->gid); > > + kinfo.euid = from_kuid_munged(user_ns, c->euid); > > + kinfo.egid = from_kgid_munged(user_ns, c->egid); > > + kinfo.suid = from_kuid_munged(user_ns, c->suid); > > + kinfo.sgid = from_kgid_munged(user_ns, c->sgid); > > + kinfo.fsuid = from_kuid_munged(user_ns, c->fsuid); > > + kinfo.fsgid = from_kgid_munged(user_ns, c->fsgid); > > + put_cred(c); > > + > > +#ifdef CONFIG_CGROUPS > > + if (request_mask & PIDFD_INFO_CGROUPID) { > > + struct cgroup *cgrp; > > + > > + guard(rcu)(); > > + cgrp = task_cgroup(task, pids_cgrp_id); > > + if (!cgrp) > > + return -ENODEV; > > Afaict this means that the task has already exited. In other words, the > cgroup id cannot be retrieved anymore for a task that has exited but not > been reaped. Frankly, I would have expected the cgroup id to be > retrievable until the task has been reaped but that's another > discussion. > > My point is if you contrast this with the other information in here: If > the task has exited but hasn't been reaped then you can still get > credentials such as *uid/*gid, and pid namespace relative information > such as pid/tgid/ppid. > > So really, I would argue that you don't want to fail this but only > report 0 here. That's me working under the assumption that cgroup ids > start from 1... > > /me checks > > Yes, they start from 1 so 0 is invalid. > > > + kinfo.cgroupid = cgroup_id(cgrp); > > Fwiw, it looks like getting the cgroup id is basically just > dereferencing pointers without having to hold any meaningful locks. So > it should be fast. So making it unconditional seems fine to me. There's an ifdef since it's an optional kernel feature, and there's also this thing that we might not have it, so I'd rather keep the flag, and set it only if we can get the information (instead of failing). As a user that seems clearer to me.