On Thu, Jun 18, 2020 at 10:45:43AM +0200, Christian Brauner wrote: > Add an ioctl() to return the PID of the init process/child reaper of a pid > namespace as seen in the caller's pid namespace. > > LXCFS is a tiny fuse filesystem used to virtualize various aspects of > procfs. It is used actively by a large number of users including ChromeOS > and cloud providers. LXCFS is run on the host. The files and directories it > creates can be bind-mounted by e.g. a container at startup and mounted over > the various procfs files the container wishes to have virtualized. When > e.g. a read request for uptime is received, LXCFS will receive the pid of > the reader. In order to virtualize the corresponding read, LXCFS needs to > know the pid of the init process of the reader's pid namespace. In order to > do this, LXCFS first needs to fork() two helper processes. The first helper > process setns() to the readers pid namespace. The second helper process is > needed to create a process that is a proper member of the pid namespace. > The second helper process then creates a ucred message with ucred.pid set > to 1 and sends it back to LXCFS. The kernel will translate the ucred.pid > field to the corresponding pid number in LXCFS's pid namespace. This way > LXCFS can learn the init pid number of the reader's pid namespace and can > go on to virtualize. Since these two forks() are costly LXCFS maintains an > init pid cache that caches a given pid for a fixed amount of time. The > cache is pruned during new read requests. However, even with the cache the > hit of the two forks() is singificant when a very large number of > containers are running. With this simple patch we add an ns ioctl that > let's a caller retrieve the init pid nr of a pid namespace through its > pid namespace fd. This _significantly_ improves our performance with a very > simple change. A caller should do something like: > - pid_t init_pid = ioctl(pid_ns_fd, NS_GET_INIT_PID); > - verify init_pid is still valid (not necessarily both but recommended): > - opening a pidfd to get a stable reference > - opening /proc/<init_pid>/ns/pid and verifying that <pid_ns_fd> > and the pid namespace fd of <init_pid> refer to the same pid namespace > > Note, it is possible for the init process of the pid namespace (identified > via the child_reaper member in the relevant pid namespace) to die and get > reaped right after the ioctl returned. If that happens there are two cases > to consider: > - if the init process was single threaded, all other processes in the pid > namespace will be zapped and any new process creation in there will fail; > A caller can detect this case since either the init pid is still around > but it is a zombie, or it already has exited and not been recycled, or it > has exited, been reaped, and also been recycled. The last case is the > most interesting one but a caller would then be able to detect that the > recycled process lives in a different pid namespace. > - if the init process was multi-threaded, then the kernel will try to make > one of the threads in the same thread-group - if any are still alive - > the new child_reaper. In this case the caller can detect that the thread > which exited and used to be the child_reaper is no longer alive. If it's > tid has been recycled in the same pid namespace a caller can detect this > by parsing through /proc/<tid>/stat, looking at the Nspid: field and if > there's a entry with pid nr 1 in the respective pid namespace it can be > sure that it hasn't been recycled. > Both options can be combined with pidfd_open() to make sure that a stable > reference is maintained. > > Cc: Wolfgang Bumiller <w.bumiller@xxxxxxxxxxx> > Cc: Serge Hallyn <serge@xxxxxxxxxx> > Cc: Michael Kerrisk <mtk.manpages@xxxxxxxxx> > Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx> > Cc: linux-fsdevel@xxxxxxxxxxxxxxx > Signed-off-by: Christian Brauner <christian.brauner@xxxxxxxxxx> fs/nsfs.c: In function ‘ns_ioctl’: fs/nsfs.c:195:14: warning: unused variable ‘pid_struct’ [-Wunused-variable] struct pid *pid_struct; ^~~~~~~~~~ fs/nsfs.c:194:22: warning: unused variable ‘child_reaper’ [-Wunused-variable] struct task_struct *child_reaper; ^~~~~~~~~~~~ > --- > fs/nsfs.c | 29 +++++++++++++++++++++++++++++ > include/uapi/linux/nsfs.h | 2 ++ > 2 files changed, 31 insertions(+) > > diff --git a/fs/nsfs.c b/fs/nsfs.c > index 800c1d0eb0d0..5a7de1ee6df0 100644 > --- a/fs/nsfs.c > +++ b/fs/nsfs.c > @@ -8,6 +8,7 @@ > #include <linux/magic.h> > #include <linux/ktime.h> > #include <linux/seq_file.h> > +#include <linux/pid_namespace.h> > #include <linux/user_namespace.h> > #include <linux/nsfs.h> > #include <linux/uaccess.h> > @@ -189,6 +190,10 @@ static long ns_ioctl(struct file *filp, unsigned int ioctl, > unsigned long arg) > { > struct user_namespace *user_ns; > + struct pid_namespace *pid_ns; > + struct task_struct *child_reaper; > + struct pid *pid_struct; > + pid_t pid; > struct ns_common *ns = get_proc_ns(file_inode(filp)); > uid_t __user *argp; > uid_t uid; > @@ -209,6 +214,30 @@ static long ns_ioctl(struct file *filp, unsigned int ioctl, > argp = (uid_t __user *) arg; > uid = from_kuid_munged(current_user_ns(), user_ns->owner); > return put_user(uid, argp); > + case NS_GET_INIT_PID: > + if (ns->ops->type != CLONE_NEWPID) > + return -EINVAL; > + > + pid_ns = container_of(ns, struct pid_namespace, ns); > + > + /* > + * If we're asking for the init pid of our own pid namespace > + * that's of course silly but no need to fail this since we can > + * both infer or find out our own pid namespaces's init pid > + * trivially. In all other cases, we require the same > + * privileges as for setns(). > + */ > + if (task_active_pid_ns(current) != pid_ns && > + !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN)) > + return -EPERM; > + > + pid = -ESRCH; > + read_lock(&tasklist_lock); > + if (likely(pid_ns->child_reaper)) > + pid = task_pid_vnr(pid_ns->child_reaper); > + read_unlock(&tasklist_lock); > + > + return pid; > default: > return -ENOTTY; > } > diff --git a/include/uapi/linux/nsfs.h b/include/uapi/linux/nsfs.h > index a0c8552b64ee..29c775f42bbe 100644 > --- a/include/uapi/linux/nsfs.h > +++ b/include/uapi/linux/nsfs.h > @@ -15,5 +15,7 @@ > #define NS_GET_NSTYPE _IO(NSIO, 0x3) > /* Get owner UID (in the caller's user namespace) for a user namespace */ > #define NS_GET_OWNER_UID _IO(NSIO, 0x4) > +/* Get init PID (in the caller's pid namespace) of a pid namespace */ > +#define NS_GET_INIT_PID _IO(NSIO, 0x5) > > #endif /* __LINUX_NSFS_H */ > > base-commit: b3a9e3b9622ae10064826dccb4f7a52bd88c7407 > -- > 2.27.0 >