Re: [PATCH] nsfs: add NS_GET_INIT_PID ioctl

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jun 18, 2020 at 10:45:43AM +0200, Christian Brauner wrote:
> Add an ioctl() to return the PID of the init process/child reaper of a pid
> namespace as seen in the caller's pid namespace.
> 
> LXCFS is a tiny fuse filesystem used to virtualize various aspects of
> procfs. It is used actively by a large number of users including ChromeOS
> and cloud providers. LXCFS is run on the host. The files and directories it
> creates can be bind-mounted by e.g. a container at startup and mounted over
> the various procfs files the container wishes to have virtualized. When
> e.g. a read request for uptime is received, LXCFS will receive the pid of
> the reader. In order to virtualize the corresponding read, LXCFS needs to
> know the pid of the init process of the reader's pid namespace. In order to
> do this, LXCFS first needs to fork() two helper processes. The first helper
> process setns() to the readers pid namespace. The second helper process is
> needed to create a process that is a proper member of the pid namespace.
> The second helper process then creates a ucred message with ucred.pid set
> to 1 and sends it back to LXCFS. The kernel will translate the ucred.pid
> field to the corresponding pid number in LXCFS's pid namespace. This way
> LXCFS can learn the init pid number of the reader's pid namespace and can
> go on to virtualize. Since these two forks() are costly LXCFS maintains an
> init pid cache that caches a given pid for a fixed amount of time. The
> cache is pruned during new read requests. However, even with the cache the
> hit of the two forks() is singificant when a very large number of
> containers are running. With this simple patch we add an ns ioctl that
> let's a caller retrieve the init pid nr of a pid namespace through its
> pid namespace fd. This _significantly_ improves our performance with a very
> simple change. A caller should do something like:
> - pid_t init_pid = ioctl(pid_ns_fd, NS_GET_INIT_PID);
> - verify init_pid is still valid (not necessarily both but recommended):
>   - opening a pidfd to get a stable reference
>   - opening /proc/<init_pid>/ns/pid and verifying that <pid_ns_fd>
>     and the pid namespace fd of <init_pid> refer to the same pid namespace
> 
> Note, it is possible for the init process of the pid namespace (identified
> via the child_reaper member in the relevant pid namespace) to die and get
> reaped right after the ioctl returned. If that happens there are two cases
> to consider:
> - if the init process was single threaded, all other processes in the pid
>   namespace will be zapped and any new process creation in there will fail;
>   A caller can detect this case since either the init pid is still around
>   but it is a zombie, or it already has exited and not been recycled, or it
>   has exited, been reaped, and also been recycled. The last case is the
>   most interesting one but a caller would then be able to detect that the
>   recycled process lives in a different pid namespace.
> - if the init process was multi-threaded, then the kernel will try to make
>   one of the threads in the same thread-group - if any are still alive -
>   the new child_reaper. In this case the caller can detect that the thread
>   which exited and used to be the child_reaper is no longer alive. If it's
>   tid has been recycled in the same pid namespace a caller can detect this
>   by parsing through /proc/<tid>/stat, looking at the Nspid: field and if
>   there's a entry with pid nr 1 in the respective pid namespace it can be
>   sure that it hasn't been recycled.
> Both options can be combined with pidfd_open() to make sure that a stable
> reference is maintained.
> 
> Cc: Wolfgang Bumiller <w.bumiller@xxxxxxxxxxx>
> Cc: Serge Hallyn <serge@xxxxxxxxxx>
> Cc: Michael Kerrisk <mtk.manpages@xxxxxxxxx>
> Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx>
> Cc: linux-fsdevel@xxxxxxxxxxxxxxx
> Signed-off-by: Christian Brauner <christian.brauner@xxxxxxxxxx>

fs/nsfs.c: In function ‘ns_ioctl’:
fs/nsfs.c:195:14: warning: unused variable ‘pid_struct’ [-Wunused-variable]
  struct pid *pid_struct;
              ^~~~~~~~~~
fs/nsfs.c:194:22: warning: unused variable ‘child_reaper’ [-Wunused-variable]
  struct task_struct *child_reaper;
                      ^~~~~~~~~~~~

> ---
>  fs/nsfs.c                 | 29 +++++++++++++++++++++++++++++
>  include/uapi/linux/nsfs.h |  2 ++
>  2 files changed, 31 insertions(+)
> 
> diff --git a/fs/nsfs.c b/fs/nsfs.c
> index 800c1d0eb0d0..5a7de1ee6df0 100644
> --- a/fs/nsfs.c
> +++ b/fs/nsfs.c
> @@ -8,6 +8,7 @@
>  #include <linux/magic.h>
>  #include <linux/ktime.h>
>  #include <linux/seq_file.h>
> +#include <linux/pid_namespace.h>
>  #include <linux/user_namespace.h>
>  #include <linux/nsfs.h>
>  #include <linux/uaccess.h>
> @@ -189,6 +190,10 @@ static long ns_ioctl(struct file *filp, unsigned int ioctl,
>  			unsigned long arg)
>  {
>  	struct user_namespace *user_ns;
> +	struct pid_namespace *pid_ns;
> +	struct task_struct *child_reaper;
> +	struct pid *pid_struct;
> +	pid_t pid;
>  	struct ns_common *ns = get_proc_ns(file_inode(filp));
>  	uid_t __user *argp;
>  	uid_t uid;
> @@ -209,6 +214,30 @@ static long ns_ioctl(struct file *filp, unsigned int ioctl,
>  		argp = (uid_t __user *) arg;
>  		uid = from_kuid_munged(current_user_ns(), user_ns->owner);
>  		return put_user(uid, argp);
> +	case NS_GET_INIT_PID:
> +		if (ns->ops->type != CLONE_NEWPID)
> +			return -EINVAL;
> +
> +		pid_ns = container_of(ns, struct pid_namespace, ns);
> +
> +		/*
> +		 * If we're asking for the init pid of our own pid namespace
> +		 * that's of course silly but no need to fail this since we can
> +		 * both infer or find out our own pid namespaces's init pid
> +		 * trivially. In all other cases, we require the same
> +		 * privileges as for setns().
> +		 */
> +		if (task_active_pid_ns(current) != pid_ns &&
> +		    !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
> +			return -EPERM;
> +
> +		pid = -ESRCH;
> +		read_lock(&tasklist_lock);
> +		if (likely(pid_ns->child_reaper))
> +			pid = task_pid_vnr(pid_ns->child_reaper);
> +		read_unlock(&tasklist_lock);
> +
> +		return pid;
>  	default:
>  		return -ENOTTY;
>  	}
> diff --git a/include/uapi/linux/nsfs.h b/include/uapi/linux/nsfs.h
> index a0c8552b64ee..29c775f42bbe 100644
> --- a/include/uapi/linux/nsfs.h
> +++ b/include/uapi/linux/nsfs.h
> @@ -15,5 +15,7 @@
>  #define NS_GET_NSTYPE		_IO(NSIO, 0x3)
>  /* Get owner UID (in the caller's user namespace) for a user namespace */
>  #define NS_GET_OWNER_UID	_IO(NSIO, 0x4)
> +/* Get init PID (in the caller's pid namespace) of a pid namespace */
> +#define NS_GET_INIT_PID		_IO(NSIO, 0x5)
>  
>  #endif /* __LINUX_NSFS_H */
> 
> base-commit: b3a9e3b9622ae10064826dccb4f7a52bd88c7407
> -- 
> 2.27.0
> 



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux