On Thu, Oct 10, 2024 at 07:50:36AM +1100, Aleksa Sarai wrote: > On 2024-10-08, luca.boccassi@xxxxxxxxx <luca.boccassi@xxxxxxxxx> wrote: > > From: Luca Boccassi <luca.boccassi@xxxxxxxxx> > > > > A common pattern when using pid fds is having to get information > > about the process, which currently requires /proc being mounted, > > resolving the fd to a pid, and then do manual string parsing of > > /proc/N/status and friends. This needs to be reimplemented over > > and over in all userspace projects (e.g.: I have reimplemented > > resolving in systemd, dbus, dbus-daemon, polkit so far), and > > requires additional care in checking that the fd is still valid > > after having parsed the data, to avoid races. > > > > Having a programmatic API that can be used directly removes all > > these requirements, including having /proc mounted. > > > > As discussed at LPC24, add an ioctl with an extensible struct > > so that more parameters can be added later if needed. Start with > > returning pid/tgid/ppid and creds unconditionally, and cgroupid > > optionally. > > > > Signed-off-by: Luca Boccassi <luca.boccassi@xxxxxxxxx> > > --- > > v9: drop result_mask and reuse request_mask instead > > v8: use RAII guard for rcu, call put_cred() > > v7: fix RCU issue and style issue introduced by v6 found by reviewer > > v6: use rcu_read_lock() when fetching cgroupid, use task_ppid_nr_ns() to > > get the ppid, return ESCHR if any of pid/tgid/ppid are 0 at the end > > of the call to avoid providing incomplete data, document what the > > callers should expect > > v5: check again that the task hasn't exited immediately before copying > > the result out to userspace, to ensure we are not returning stale data > > add an ifdef around the cgroup structs usage to fix build errors when > > the feature is disabled > > v4: fix arg check in pidfd_ioctl() by moving it after the new call > > v3: switch from pid_vnr() to task_pid_vnr() > > v2: Apply comments from Christian, apart from the one about pid namespaces > > as I need additional hints on how to implement it. > > Drop the security_context string as it is not the appropriate > > metadata to give userspace these days. > > > > fs/pidfs.c | 88 ++++++++++++++++++- > > include/uapi/linux/pidfd.h | 30 +++++++ > > .../testing/selftests/pidfd/pidfd_open_test.c | 80 ++++++++++++++++- > > 3 files changed, 194 insertions(+), 4 deletions(-) > > > > diff --git a/fs/pidfs.c b/fs/pidfs.c > > index 80675b6bf884..15cdc7fe4968 100644 > > --- a/fs/pidfs.c > > +++ b/fs/pidfs.c > > @@ -2,6 +2,7 @@ > > #include <linux/anon_inodes.h> > > #include <linux/file.h> > > #include <linux/fs.h> > > +#include <linux/cgroup.h> > > #include <linux/magic.h> > > #include <linux/mount.h> > > #include <linux/pid.h> > > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) > > return poll_flags; > > } > > > > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg) > > +{ > > + struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg; > > + size_t usize = _IOC_SIZE(cmd); > > + struct pidfd_info kinfo = {}; > > + struct user_namespace *user_ns; > > + const struct cred *c; > > + __u64 request_mask; > > + > > + if (!uinfo) > > + return -EINVAL; > > + if (usize < sizeof(struct pidfd_info)) > > + return -EINVAL; /* First version, no smaller struct possible */ > > + > > + if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask))) > > + return -EFAULT; > > + > > + c = get_task_cred(task); > > + if (!c) > > + return -ESRCH; > > + > > + /* Unconditionally return identifiers and credentials, the rest only on request */ > > + > > + user_ns = current_user_ns(); > > + kinfo.ruid = from_kuid_munged(user_ns, c->uid); > > + kinfo.rgid = from_kgid_munged(user_ns, c->gid); > > + kinfo.euid = from_kuid_munged(user_ns, c->euid); > > + kinfo.egid = from_kgid_munged(user_ns, c->egid); > > + kinfo.suid = from_kuid_munged(user_ns, c->suid); > > + kinfo.sgid = from_kgid_munged(user_ns, c->sgid); > > + kinfo.fsuid = from_kuid_munged(user_ns, c->fsuid); > > + kinfo.fsgid = from_kgid_munged(user_ns, c->fsgid); > > + put_cred(c); > > + > > +#ifdef CONFIG_CGROUPS > > + if (request_mask & PIDFD_INFO_CGROUPID) { > > + struct cgroup *cgrp; > > + > > + guard(rcu)(); > > + cgrp = task_cgroup(task, pids_cgrp_id); > > + if (!cgrp) > > + return -ENODEV; > > + kinfo.cgroupid = cgroup_id(cgrp); > > + > > + kinfo.request_mask |= PIDFD_INFO_CGROUPID; > > + } > > +#endif > > + > > + /* > > + * Copy pid/tgid last, to reduce the chances the information might be > > + * stale. Note that it is not possible to ensure it will be valid as the > > + * task might return as soon as the copy_to_user finishes, but that's ok > > + * and userspace expects that might happen and can act accordingly, so > > + * this is just best-effort. What we can do however is checking that all > > + * the fields are set correctly, or return ESRCH to avoid providing > > + * incomplete information. */ > > + > > + kinfo.ppid = task_ppid_nr_ns(task, NULL); > > + kinfo.tgid = task_tgid_vnr(task); > > + kinfo.pid = task_pid_vnr(task); > > + > > + if (kinfo.pid == 0 || kinfo.tgid == 0 || (kinfo.ppid == 0 && kinfo.pid != 1)) > > + return -ESRCH; > > + > > + /* > > + * If userspace and the kernel have the same struct size it can just > > + * be copied. If userspace provides an older struct, only the bits that > > + * userspace knows about will be copied. If userspace provides a new > > + * struct, only the bits that the kernel knows about will be copied and > > + * the size value will be set to the size the kernel knows about. > > + */ > > + if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo)))) > > + return -EFAULT; > > If usize > ksize, we also want to clear_user() the trailing bytes to > avoid userspace thinking that any garbage bytes they had are valid. > > Also, you mention "the size value" but there is no size in pidfd_info. I > don't think it's actually necessary to include such a field (especially > when you have a statx-like request_mask), but it means you really should > clear the trailing bytes to avoid userspace bugs. > > I implemented all of these semantics as copy_struct_to_user() in the > CHECK_FIELDS patch I sent a few weeks ago (I just sent v3[1]). Maybe you > can cherry-pick this patch and use it? The semantics when we extend this > pidfd_info to accept new request_mask values with larger structures is > going to get a little ugly and copy_struct_to_user() makes this a little > easier to deal with. > > [1]: https://lore.kernel.org/all/20241010-extensible-structs-check_fields-v3-1-d2833dfe6edd@xxxxxxxxxx/ I agree. @Luca, you can either send the two patches together or I can just port the patch to it. I don't mind. > > > + > > + return 0; > > +} > > + > > static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg) > > { > > struct task_struct *task __free(put_task) = NULL; > > @@ -122,13 +200,17 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg) > > struct ns_common *ns_common = NULL; > > struct pid_namespace *pid_ns; > > > > - if (arg) > > - return -EINVAL; > > - > > task = get_pid_task(pid, PIDTYPE_PID); > > if (!task) > > return -ESRCH; > > > > + /* Extensible IOCTL that does not open namespace FDs, take a shortcut */ > > + if (_IOC_NR(cmd) == _IOC_NR(PIDFD_GET_INFO)) > > + return pidfd_info(task, cmd, arg); > > + > > + if (arg) > > + return -EINVAL; > > + > > scoped_guard(task_lock, task) { > > nsp = task->nsproxy; > > if (nsp) > > diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h > > index 565fc0629fff..d685eeeedc51 100644 > > --- a/include/uapi/linux/pidfd.h > > +++ b/include/uapi/linux/pidfd.h > > @@ -16,6 +16,35 @@ > > #define PIDFD_SIGNAL_THREAD_GROUP (1UL << 1) > > #define PIDFD_SIGNAL_PROCESS_GROUP (1UL << 2) > > > > +/* Flags for pidfd_info. */ > > +#define PIDFD_INFO_CGROUPID (1UL << 0) > > While it isn't strictly necessary, maybe we should provide some > always-set bits like statx does? While they would always be set, it Yes, good idea. > might incentivise programs to write code that checks if the request_mask > bits are set after the ioctl(2) returns from the outset. Then again, > PIDFD_INFO_CGROUPID is probably enough to justify writing code correctly > from the outset. > > > + > > +struct pidfd_info { > > + /* Let userspace request expensive stuff explictly, and let the kernel > > + * indicate whether it knows about it. */ > > I would prefer a slightly more informative comment (which mentions that > this will also be used for extensibility), something like: > > > /* > * This mask is similar to the request_mask in statx(2). > * > * Userspace indicates what extensions or expensive-to-calculate fields > * they want by setting the corresponding bits in request_mask. > * > * When filling the structure, the kernel will only set bits > * corresponding to the fields that were actually filled by the kernel. > * This also includes any future extensions that might be automatically > * filled. If the structure size is too small to contain a field > * (requested or not), to avoid confusion the request_mask will not > * contain a bit for that field. > * > * As such, userspace MUST verify that request_mask contains the > * corresponding flags after the ioctl(2) returns to ensure that it is > * using valid data. > */ I agree. This is a good comment.