Re: [PATCH v9] pidfd: add ioctl to retrieve pid info

Christian Brauner <brauner@xxxxxxxxxx> · Thu, 10 Oct 2024 11:46:31 +0200

On Thu, Oct 10, 2024 at 07:50:36AM +1100, Aleksa Sarai wrote:
> On 2024-10-08, luca.boccassi@xxxxxxxxx <luca.boccassi@xxxxxxxxx> wrote:
> > From: Luca Boccassi <luca.boccassi@xxxxxxxxx>
> > 
> > A common pattern when using pid fds is having to get information
> > about the process, which currently requires /proc being mounted,
> > resolving the fd to a pid, and then do manual string parsing of
> > /proc/N/status and friends. This needs to be reimplemented over
> > and over in all userspace projects (e.g.: I have reimplemented
> > resolving in systemd, dbus, dbus-daemon, polkit so far), and
> > requires additional care in checking that the fd is still valid
> > after having parsed the data, to avoid races.
> > 
> > Having a programmatic API that can be used directly removes all
> > these requirements, including having /proc mounted.
> > 
> > As discussed at LPC24, add an ioctl with an extensible struct
> > so that more parameters can be added later if needed. Start with
> > returning pid/tgid/ppid and creds unconditionally, and cgroupid
> > optionally.
> > 
> > Signed-off-by: Luca Boccassi <luca.boccassi@xxxxxxxxx>
> > ---
> > v9: drop result_mask and reuse request_mask instead
> > v8: use RAII guard for rcu, call put_cred()
> > v7: fix RCU issue and style issue introduced by v6 found by reviewer
> > v6: use rcu_read_lock() when fetching cgroupid, use task_ppid_nr_ns() to
> >     get the ppid, return ESCHR if any of pid/tgid/ppid are 0 at the end
> >     of the call to avoid providing incomplete data, document what the
> >     callers should expect
> > v5: check again that the task hasn't exited immediately before copying
> >     the result out to userspace, to ensure we are not returning stale data
> >     add an ifdef around the cgroup structs usage to fix build errors when
> >     the feature is disabled
> > v4: fix arg check in pidfd_ioctl() by moving it after the new call
> > v3: switch from pid_vnr() to task_pid_vnr()
> > v2: Apply comments from Christian, apart from the one about pid namespaces
> >     as I need additional hints on how to implement it.
> >     Drop the security_context string as it is not the appropriate
> >     metadata to give userspace these days.
> > 
> >  fs/pidfs.c                                    | 88 ++++++++++++++++++-
> >  include/uapi/linux/pidfd.h                    | 30 +++++++
> >  .../testing/selftests/pidfd/pidfd_open_test.c | 80 ++++++++++++++++-
> >  3 files changed, 194 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/pidfs.c b/fs/pidfs.c
> > index 80675b6bf884..15cdc7fe4968 100644
> > --- a/fs/pidfs.c
> > +++ b/fs/pidfs.c
> > @@ -2,6 +2,7 @@
> >  #include <linux/anon_inodes.h>
> >  #include <linux/file.h>
> >  #include <linux/fs.h>
> > +#include <linux/cgroup.h>
> >  #include <linux/magic.h>
> >  #include <linux/mount.h>
> >  #include <linux/pid.h>
> > @@ -114,6 +115,83 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
> >  	return poll_flags;
> >  }
> >  
> > +static long pidfd_info(struct task_struct *task, unsigned int cmd, unsigned long arg)
> > +{
> > +	struct pidfd_info __user *uinfo = (struct pidfd_info __user *)arg;
> > +	size_t usize = _IOC_SIZE(cmd);
> > +	struct pidfd_info kinfo = {};
> > +	struct user_namespace *user_ns;
> > +	const struct cred *c;
> > +	__u64 request_mask;
> > +
> > +	if (!uinfo)
> > +		return -EINVAL;
> > +	if (usize < sizeof(struct pidfd_info))
> > +		return -EINVAL; /* First version, no smaller struct possible */
> > +
> > +	if (copy_from_user(&request_mask, &uinfo->request_mask, sizeof(request_mask)))
> > +		return -EFAULT;
> > +
> > +	c = get_task_cred(task);
> > +	if (!c)
> > +		return -ESRCH;
> > +
> > +	/* Unconditionally return identifiers and credentials, the rest only on request */
> > +
> > +	user_ns = current_user_ns();
> > +	kinfo.ruid = from_kuid_munged(user_ns, c->uid);
> > +	kinfo.rgid = from_kgid_munged(user_ns, c->gid);
> > +	kinfo.euid = from_kuid_munged(user_ns, c->euid);
> > +	kinfo.egid = from_kgid_munged(user_ns, c->egid);
> > +	kinfo.suid = from_kuid_munged(user_ns, c->suid);
> > +	kinfo.sgid = from_kgid_munged(user_ns, c->sgid);
> > +	kinfo.fsuid = from_kuid_munged(user_ns, c->fsuid);
> > +	kinfo.fsgid = from_kgid_munged(user_ns, c->fsgid);
> > +	put_cred(c);
> > +
> > +#ifdef CONFIG_CGROUPS
> > +	if (request_mask & PIDFD_INFO_CGROUPID) {
> > +		struct cgroup *cgrp;
> > +
> > +		guard(rcu)();
> > +		cgrp = task_cgroup(task, pids_cgrp_id);
> > +		if (!cgrp)
> > +			return -ENODEV;
> > +		kinfo.cgroupid = cgroup_id(cgrp);
> > +
> > +		kinfo.request_mask |= PIDFD_INFO_CGROUPID;
> > +	}
> > +#endif
> > +
> > +	/*
> > +	 * Copy pid/tgid last, to reduce the chances the information might be
> > +	 * stale. Note that it is not possible to ensure it will be valid as the
> > +	 * task might return as soon as the copy_to_user finishes, but that's ok
> > +	 * and userspace expects that might happen and can act accordingly, so
> > +	 * this is just best-effort. What we can do however is checking that all
> > +	 * the fields are set correctly, or return ESRCH to avoid providing
> > +	 * incomplete information. */
> > +
> > +	kinfo.ppid = task_ppid_nr_ns(task, NULL);
> > +	kinfo.tgid = task_tgid_vnr(task);
> > +	kinfo.pid = task_pid_vnr(task);
> > +
> > +	if (kinfo.pid == 0 || kinfo.tgid == 0 || (kinfo.ppid == 0 && kinfo.pid != 1))
> > +		return -ESRCH;
> > +
> > +	/*
> > +	 * If userspace and the kernel have the same struct size it can just
> > +	 * be copied. If userspace provides an older struct, only the bits that
> > +	 * userspace knows about will be copied. If userspace provides a new
> > +	 * struct, only the bits that the kernel knows about will be copied and
> > +	 * the size value will be set to the size the kernel knows about.
> > +	 */
> > +	if (copy_to_user(uinfo, &kinfo, min(usize, sizeof(kinfo))))
> > +		return -EFAULT;
> 
> If usize > ksize, we also want to clear_user() the trailing bytes to
> avoid userspace thinking that any garbage bytes they had are valid.
> 
> Also, you mention "the size value" but there is no size in pidfd_info. I
> don't think it's actually necessary to include such a field (especially
> when you have a statx-like request_mask), but it means you really should
> clear the trailing bytes to avoid userspace bugs.
> 
> I implemented all of these semantics as copy_struct_to_user() in the
> CHECK_FIELDS patch I sent a few weeks ago (I just sent v3[1]). Maybe you
> can cherry-pick this patch and use it? The semantics when we extend this
> pidfd_info to accept new request_mask values with larger structures is
> going to get a little ugly and copy_struct_to_user() makes this a little
> easier to deal with.
> 
> [1]: https://lore.kernel.org/all/20241010-extensible-structs-check_fields-v3-1-d2833dfe6edd@xxxxxxxxxx/

I agree. @Luca, you can either send the two patches together or I can
just port the patch to it. I don't mind.

> 
> > +
> > +	return 0;
> > +}
> > +
> >  static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> >  {
> >  	struct task_struct *task __free(put_task) = NULL;
> > @@ -122,13 +200,17 @@ static long pidfd_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> >  	struct ns_common *ns_common = NULL;
> >  	struct pid_namespace *pid_ns;
> >  
> > -	if (arg)
> > -		return -EINVAL;
> > -
> >  	task = get_pid_task(pid, PIDTYPE_PID);
> >  	if (!task)
> >  		return -ESRCH;
> >  
> > +	/* Extensible IOCTL that does not open namespace FDs, take a shortcut */
> > +	if (_IOC_NR(cmd) == _IOC_NR(PIDFD_GET_INFO))
> > +		return pidfd_info(task, cmd, arg);
> > +
> > +	if (arg)
> > +		return -EINVAL;
> > +
> >  	scoped_guard(task_lock, task) {
> >  		nsp = task->nsproxy;
> >  		if (nsp)
> > diff --git a/include/uapi/linux/pidfd.h b/include/uapi/linux/pidfd.h
> > index 565fc0629fff..d685eeeedc51 100644
> > --- a/include/uapi/linux/pidfd.h
> > +++ b/include/uapi/linux/pidfd.h
> > @@ -16,6 +16,35 @@
> >  #define PIDFD_SIGNAL_THREAD_GROUP	(1UL << 1)
> >  #define PIDFD_SIGNAL_PROCESS_GROUP	(1UL << 2)
> >  
> > +/* Flags for pidfd_info. */
> > +#define PIDFD_INFO_CGROUPID		(1UL << 0)
> 
> While it isn't strictly necessary, maybe we should provide some
> always-set bits like statx does? While they would always be set, it

Yes, good idea.

> might incentivise programs to write code that checks if the request_mask
> bits are set after the ioctl(2) returns from the outset. Then again,
> PIDFD_INFO_CGROUPID is probably enough to justify writing code correctly
> from the outset.
> 
> > +
> > +struct pidfd_info {
> > +	/* Let userspace request expensive stuff explictly, and let the kernel
> > +	 * indicate whether it knows about it. */
> 
> I would prefer a slightly more informative comment (which mentions that
> this will also be used for extensibility), something like:
> 
> 
> /*
>  * This mask is similar to the request_mask in statx(2).
>  *
>  * Userspace indicates what extensions or expensive-to-calculate fields
>  * they want by setting the corresponding bits in request_mask.
>  *
>  * When filling the structure, the kernel will only set bits
>  * corresponding to the fields that were actually filled by the kernel.
>  * This also includes any future extensions that might be automatically
>  * filled. If the structure size is too small to contain a field
>  * (requested or not), to avoid confusion the request_mask will not
>  * contain a bit for that field.
>  *
>  * As such, userspace MUST verify that request_mask contains the
>  * corresponding flags after the ioctl(2) returns to ensure that it is
>  * using valid data.
>  */

I agree. This is a good comment.