On Thu, Oct 10, 2024 at 07:52:20AM +1100, Aleksa Sarai wrote: > On 2024-10-08, Luca Boccassi <luca.boccassi@xxxxxxxxx> wrote: > > On Tue, 8 Oct 2024 at 14:06, Christian Brauner <brauner@xxxxxxxxxx> wrote: > > > > > > On Tue, Oct 08, 2024 at 01:18:20PM GMT, luca.boccassi@xxxxxxxxx wrote: > > > > From: Luca Boccassi <luca.boccassi@xxxxxxxxx> > > > > > > > > A common pattern when using pid fds is having to get information > > > > about the process, which currently requires /proc being mounted, > > > > resolving the fd to a pid, and then do manual string parsing of > > > > /proc/N/status and friends. This needs to be reimplemented over > > > > and over in all userspace projects (e.g.: I have reimplemented > > > > resolving in systemd, dbus, dbus-daemon, polkit so far), and > > > > requires additional care in checking that the fd is still valid > > > > after having parsed the data, to avoid races. > > > > > > > > Having a programmatic API that can be used directly removes all > > > > these requirements, including having /proc mounted. > > > > > > > > As discussed at LPC24, add an ioctl with an extensible struct > > > > so that more parameters can be added later if needed. Start with > > > > returning pid/tgid/ppid and creds unconditionally, and cgroupid > > > > optionally. > > > > > > > > Signed-off-by: Luca Boccassi <luca.boccassi@xxxxxxxxx> > > > > --- > > > > v9: drop result_mask and reuse request_mask instead > > > > v8: use RAII guard for rcu, call put_cred() > > > > v7: fix RCU issue and style issue introduced by v6 found by reviewer > > > > v6: use rcu_read_lock() when fetching cgroupid, use task_ppid_nr_ns() to > > > > get the ppid, return ESCHR if any of pid/tgid/ppid are 0 at the end > > > > of the call to avoid providing incomplete data, document what the > > > > callers should expect > > > > v5: check again that the task hasn't exited immediately before copying > > > > the result out to userspace, to ensure we are not returning stale data > > > > add an ifdef around the cgroup structs usage to fix build errors when > > > > the feature is disabled > > > > v4: fix arg check in pidfd_ioctl() by moving it after the new call > > > > v3: switch from pid_vnr() to task_pid_vnr() > > > > v2: Apply comments from Christian, apart from the one about pid namespaces > > > > as I need additional hints on how to implement it. > > > > Drop the security_context string as it is not the appropriate > > > > metadata to give userspace these days. > > > > > > > > fs/pidfs.c | 88 ++++++++++++++++++- > > > > include/uapi/linux/pidfd.h | 30 +++++++ > > > > .../testing/selftests/pidfd/pidfd_open_test.c | 80 ++++++++++++++++- > > > > 3 files changed, 194 insertions(+), 4 deletions(-) > > > > > > > > diff --git a/fs/pidfs.c b/fs/pidfs.c > > > > index 80675b6bf884..15cdc7fe4968 100644 > > > > --- a/fs/pidfs.c > > > > +++ b/fs/pidfs.c > > > > @@ -2,6 +2,7 @@ > > > > #include <linux/anon_inodes.h> > > > > #include <linux/file.h> > > > > #include <linux/fs.h> > > > > +#include <linux/cgroup.h> > > > > #include <linux/magic.h> > > > > #include <linux/mount.h> > > > > #include <linux/pid.h> > > > > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) > > > > return poll_flags; > > > > } > > > > > > > > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg) > > > > +{ > > > > + struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg; > > > > + size_t usize = _IOC_SIZE(cmd); > > > > + struct pidfd_info kinfo = {}; > > > > + struct user_namespace *user_ns; > > > > + const struct cred *c; > > > > + __u64 request_mask; > > > > + > > > > + if (!uinfo) > > > > + return -EINVAL; > > > > + if (usize < sizeof(struct pidfd_info)) > > > > + return -EINVAL; /* First version, no smaller struct possible */ > > > > + > > > > + if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask))) > > > > + return -EFAULT; > > > > + > > > > + c = get_task_cred(task); > > > > + if (!c) > > > > + return -ESRCH; > > > > + > > > > + /* Unconditionally return identifiers and credentials, the rest only on request */ > > > > + > > > > + user_ns = current_user_ns(); > > > > + kinfo.ruid = from_kuid_munged(user_ns, c->uid); > > > > + kinfo.rgid = from_kgid_munged(user_ns, c->gid); > > > > + kinfo.euid = from_kuid_munged(user_ns, c->euid); > > > > + kinfo.egid = from_kgid_munged(user_ns, c->egid); > > > > + kinfo.suid = from_kuid_munged(user_ns, c->suid); > > > > + kinfo.sgid = from_kgid_munged(user_ns, c->sgid); > > > > + kinfo.fsuid = from_kuid_munged(user_ns, c->fsuid); > > > > + kinfo.fsgid = from_kgid_munged(user_ns, c->fsgid); > > > > + put_cred(c); > > > > + > > > > +#ifdef CONFIG_CGROUPS > > > > + if (request_mask & PIDFD_INFO_CGROUPID) { > > > > + struct cgroup *cgrp; > > > > + > > > > + guard(rcu)(); > > > > + cgrp = task_cgroup(task, pids_cgrp_id); > > > > + if (!cgrp) > > > > + return -ENODEV; > > > > > > Afaict this means that the task has already exited. In other words, the > > > cgroup id cannot be retrieved anymore for a task that has exited but not > > > been reaped. Frankly, I would have expected the cgroup id to be > > > retrievable until the task has been reaped but that's another > > > discussion. > > > > > > My point is if you contrast this with the other information in here: If > > > the task has exited but hasn't been reaped then you can still get > > > credentials such as *uid/*gid, and pid namespace relative information > > > such as pid/tgid/ppid. > > > > > > So really, I would argue that you don't want to fail this but only > > > report 0 here. That's me working under the assumption that cgroup ids > > > start from 1... > > > > > > /me checks > > > > > > Yes, they start from 1 so 0 is invalid. > > > > > > > + kinfo.cgroupid = cgroup_id(cgrp); > > > > > > Fwiw, it looks like getting the cgroup id is basically just > > > dereferencing pointers without having to hold any meaningful locks. So > > > it should be fast. So making it unconditional seems fine to me. > > > > There's an ifdef since it's an optional kernel feature, and there's > > also this thing that we might not have it, so I'd rather keep the > > flag, and set it only if we can get the information (instead of > > failing). As a user that seems clearer to me. > > I think we should keep the request_mask flag when returning from the > kernel, but it's not necessary for the user to request it explicitly > because it's cheap to get. This is how STATX_MNT_ID works and I think it > makes sense to do it that way. So what we should do is not require userspace to request it explicitly but always raise the flag in request_mask when it's available. I agree.