Alexey Gladkov <gladkov.alexey@xxxxxxxxx> writes: > This patch allows to have multiple procfs instances inside the > same pid namespace. The aim here is lightweight sandboxes, and to allow > that we have to modernize procfs internals. > > 1) The main aim of this work is to have on embedded systems one > supervisor for apps. Right now we have some lightweight sandbox support, > however if we create pid namespacess we have to manages all the > processes inside too, where our goal is to be able to run a bunch of > apps each one inside its own mount namespace without being able to > notice each other. We only want to use mount namespaces, and we want > procfs to behave more like a real mount point. > > 2) Linux Security Modules have multiple ptrace paths inside some > subsystems, however inside procfs, the implementation does not guarantee > that the ptrace() check which triggers the security_ptrace_check() hook > will always run. We have the 'hidepid' mount option that can be used to > force the ptrace_may_access() check inside has_pid_permissions() to run. > The problem is that 'hidepid' is per pid namespace and not attached to > the mount point, any remount or modification of 'hidepid' will propagate > to all other procfs mounts. > > This also does not allow to support Yama LSM easily in desktop and user > sessions. Yama ptrace scope which restricts ptrace and some other > syscalls to be allowed only on inferiors, can be updated to have a > per-task context, where the context will be inherited during fork(), > clone() and preserved across execve(). If we support multiple private > procfs instances, then we may force the ptrace_may_access() on > /proc/<pids>/ to always run inside that new procfs instances. This will > allow to specifiy on user sessions if we should populate procfs with > pids that the user can ptrace or not. > > By using Yama ptrace scope, some restricted users will only be able to see > inferiors inside /proc, they won't even be able to see their other > processes. Some software like Chromium, Firefox's crash handler, Wine > and others are already using Yama to restrict which processes can be > ptracable. With this change this will give the possibility to restrict > /proc/<pids>/ but more importantly this will give desktop users a > generic and usuable way to specifiy which users should see all processes > and which users can not. > > Side notes: > * This covers the lack of seccomp where it is not able to parse > arguments, it is easy to install a seccomp filter on direct syscalls > that operate on pids, however /proc/<pid>/ is a Linux ABI using > filesystem syscalls. With this change LSMs should be able to analyze > open/read/write/close... > > In the new patchset version I removed the 'newinstance' option > as suggested by Eric W. Biederman. Some very small requests. 1) Can you please not place fs_info in fs_context, and instead allocate fs_info in fill_super? Unless I have misread introduced a resource leak if proc is not mounted or if proc is simply reconfigured. 2) Can you please move hide_pid and pid_gid into fs_info in this patch? As was shown by my recent bug fix 3) Can you please rebase on on v5.7-rc1 or v5.7-rc2 and repost these patches please? I thought I could do it safely but between my bug fixes, and Alexey Dobriyan's parallel changes to proc these patches do not apply cleanly. Plus there is a resource leak in this patch. Eric > Signed-off-by: Alexey Gladkov <gladkov.alexey@xxxxxxxxx> > Reviewed-by: Alexey Dobriyan <adobriyan@xxxxxxxxx> > Reviewed-by: Kees Cook <keescook@xxxxxxxxxxxx> > --- > fs/proc/base.c | 13 +++++++---- > fs/proc/inode.c | 4 ++-- > fs/proc/root.c | 42 ++++++++++++++++++++++------------- > fs/proc/self.c | 6 ++--- > fs/proc/thread_self.c | 6 ++--- > include/linux/pid_namespace.h | 4 ---- > include/linux/proc_fs.h | 12 ++++++++++ > 7 files changed, 55 insertions(+), 32 deletions(-) > > diff --git a/fs/proc/base.c b/fs/proc/base.c > index 74f948a6b621..3b9155a69ade 100644 > --- a/fs/proc/base.c > +++ b/fs/proc/base.c > @@ -3301,6 +3301,7 @@ struct dentry *proc_pid_lookup(struct dentry *dentry, unsigned int flags) > { > struct task_struct *task; > unsigned tgid; > + struct proc_fs_info *fs_info; > struct pid_namespace *ns; > struct dentry *result = ERR_PTR(-ENOENT); > > @@ -3308,7 +3309,8 @@ struct dentry *proc_pid_lookup(struct dentry *dentry, unsigned int flags) > if (tgid == ~0U) > goto out; > > - ns = dentry->d_sb->s_fs_info; > + fs_info = proc_sb_info(dentry->d_sb); > + ns = fs_info->pid_ns; > rcu_read_lock(); > task = find_task_by_pid_ns(tgid, ns); > if (task) > @@ -3372,6 +3374,7 @@ static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter ite > int proc_pid_readdir(struct file *file, struct dir_context *ctx) > { > struct tgid_iter iter; > + struct proc_fs_info *fs_info = proc_sb_info(file_inode(file)->i_sb); > struct pid_namespace *ns = proc_pid_ns(file_inode(file)); > loff_t pos = ctx->pos; > > @@ -3379,13 +3382,13 @@ int proc_pid_readdir(struct file *file, struct dir_context *ctx) > return 0; > > if (pos == TGID_OFFSET - 2) { > - struct inode *inode = d_inode(ns->proc_self); > + struct inode *inode = d_inode(fs_info->proc_self); > if (!dir_emit(ctx, "self", 4, inode->i_ino, DT_LNK)) > return 0; > ctx->pos = pos = pos + 1; > } > if (pos == TGID_OFFSET - 1) { > - struct inode *inode = d_inode(ns->proc_thread_self); > + struct inode *inode = d_inode(fs_info->proc_thread_self); > if (!dir_emit(ctx, "thread-self", 11, inode->i_ino, DT_LNK)) > return 0; > ctx->pos = pos = pos + 1; > @@ -3599,6 +3602,7 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry > struct task_struct *task; > struct task_struct *leader = get_proc_task(dir); > unsigned tid; > + struct proc_fs_info *fs_info; > struct pid_namespace *ns; > struct dentry *result = ERR_PTR(-ENOENT); > > @@ -3609,7 +3613,8 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry > if (tid == ~0U) > goto out; > > - ns = dentry->d_sb->s_fs_info; > + fs_info = proc_sb_info(dentry->d_sb); > + ns = fs_info->pid_ns; > rcu_read_lock(); > task = find_task_by_pid_ns(tid, ns); > if (task) > diff --git a/fs/proc/inode.c b/fs/proc/inode.c > index 1e730ea1dcd6..6e4c6728338b 100644 > --- a/fs/proc/inode.c > +++ b/fs/proc/inode.c > @@ -167,8 +167,8 @@ void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock > > static int proc_show_options(struct seq_file *seq, struct dentry *root) > { > - struct super_block *sb = root->d_sb; > - struct pid_namespace *pid = sb->s_fs_info; > + struct proc_fs_info *fs_info = proc_sb_info(root->d_sb); > + struct pid_namespace *pid = fs_info->pid_ns; > > if (!gid_eq(pid->pid_gid, GLOBAL_ROOT_GID)) > seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, pid->pid_gid)); > diff --git a/fs/proc/root.c b/fs/proc/root.c > index 2633f10446c3..b28adbb0b937 100644 > --- a/fs/proc/root.c > +++ b/fs/proc/root.c > @@ -30,7 +30,7 @@ > #include "internal.h" > > struct proc_fs_context { > - struct pid_namespace *pid_ns; > + struct proc_fs_info *fs_info; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Please don't do this. As best as I can tell that introduces a memory leak of proc is not mounted. Please allocate fs_info in > unsigned int mask; > int hidepid; > int gid; > @@ -92,7 +92,8 @@ static void proc_apply_options(struct super_block *s, > > static int proc_fill_super(struct super_block *s, struct fs_context *fc) > { > - struct pid_namespace *pid_ns = get_pid_ns(s->s_fs_info); > + struct proc_fs_context *ctx = fc->fs_private; > + struct pid_namespace *pid_ns = get_pid_ns(ctx->fs_info->pid_ns); > struct inode *root_inode; > int ret; > > @@ -106,6 +107,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc) > s->s_magic = PROC_SUPER_MAGIC; > s->s_op = &proc_sops; > s->s_time_gran = 1; > + s->s_fs_info = ctx->fs_info; > > /* > * procfs isn't actually a stacking filesystem; however, there is > @@ -113,7 +115,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc) > * top of it > */ > s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH; > - > + > /* procfs dentries and inodes don't require IO to create */ > s->s_shrink.seeks = 0; > > @@ -140,7 +142,8 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc) > static int proc_reconfigure(struct fs_context *fc) > { > struct super_block *sb = fc->root->d_sb; > - struct pid_namespace *pid = sb->s_fs_info; > + struct proc_fs_info *fs_info = proc_sb_info(sb); > + struct pid_namespace *pid = fs_info->pid_ns; > > sync_filesystem(sb); > > @@ -150,16 +153,14 @@ static int proc_reconfigure(struct fs_context *fc) > > static int proc_get_tree(struct fs_context *fc) > { > - struct proc_fs_context *ctx = fc->fs_private; > - > - return get_tree_keyed(fc, proc_fill_super, ctx->pid_ns); > + return get_tree_nodev(fc, proc_fill_super); > } > > static void proc_fs_context_free(struct fs_context *fc) > { > struct proc_fs_context *ctx = fc->fs_private; > > - put_pid_ns(ctx->pid_ns); > + put_pid_ns(ctx->fs_info->pid_ns); > kfree(ctx); > } > > @@ -178,9 +179,15 @@ static int proc_init_fs_context(struct fs_context *fc) > if (!ctx) > return -ENOMEM; > > - ctx->pid_ns = get_pid_ns(task_active_pid_ns(current)); > + ctx->fs_info = kzalloc(sizeof(struct proc_fs_info), GFP_KERNEL); > + if (!ctx->fs_info) { > + kfree(ctx); > + return -ENOMEM; > + } > + > + ctx->fs_info->pid_ns = get_pid_ns(task_active_pid_ns(current)); > put_user_ns(fc->user_ns); > - fc->user_ns = get_user_ns(ctx->pid_ns->user_ns); > + fc->user_ns = get_user_ns(ctx->fs_info->pid_ns->user_ns); > fc->fs_private = ctx; > fc->ops = &proc_fs_context_ops; > return 0; > @@ -188,15 +195,18 @@ static int proc_init_fs_context(struct fs_context *fc) > > static void proc_kill_sb(struct super_block *sb) > { > - struct pid_namespace *ns; > + struct proc_fs_info *fs_info = proc_sb_info(sb); > + struct pid_namespace *ns = fs_info->pid_ns; > + > + if (fs_info->proc_self) > + dput(fs_info->proc_self); > + > + if (fs_info->proc_thread_self) > + dput(fs_info->proc_thread_self); > > - ns = (struct pid_namespace *)sb->s_fs_info; > - if (ns->proc_self) > - dput(ns->proc_self); > - if (ns->proc_thread_self) > - dput(ns->proc_thread_self); > kill_anon_super(sb); > put_pid_ns(ns); > + kfree(fs_info); > } > > static struct file_system_type proc_fs_type = { > diff --git a/fs/proc/self.c b/fs/proc/self.c > index 57c0a1047250..309301ac0136 100644 > --- a/fs/proc/self.c > +++ b/fs/proc/self.c > @@ -36,10 +36,10 @@ static unsigned self_inum __ro_after_init; > int proc_setup_self(struct super_block *s) > { > struct inode *root_inode = d_inode(s->s_root); > - struct pid_namespace *ns = proc_pid_ns(root_inode); > + struct proc_fs_info *fs_info = proc_sb_info(s); > struct dentry *self; > int ret = -ENOMEM; > - > + > inode_lock(root_inode); > self = d_alloc_name(s->s_root, "self"); > if (self) { > @@ -62,7 +62,7 @@ int proc_setup_self(struct super_block *s) > if (ret) > pr_err("proc_fill_super: can't allocate /proc/self\n"); > else > - ns->proc_self = self; > + fs_info->proc_self = self; > > return ret; > } > diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c > index f61ae53533f5..2493cbbdfa6f 100644 > --- a/fs/proc/thread_self.c > +++ b/fs/proc/thread_self.c > @@ -36,7 +36,7 @@ static unsigned thread_self_inum __ro_after_init; > int proc_setup_thread_self(struct super_block *s) > { > struct inode *root_inode = d_inode(s->s_root); > - struct pid_namespace *ns = proc_pid_ns(root_inode); > + struct proc_fs_info *fs_info = proc_sb_info(s); > struct dentry *thread_self; > int ret = -ENOMEM; > > @@ -60,9 +60,9 @@ int proc_setup_thread_self(struct super_block *s) > inode_unlock(root_inode); > > if (ret) > - pr_err("proc_fill_super: can't allocate /proc/thread_self\n"); > + pr_err("proc_fill_super: can't allocate /proc/thread-self\n"); > else > - ns->proc_thread_self = thread_self; > + fs_info->proc_thread_self = thread_self; > > return ret; > } > diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h > index 4956e362e55e..de4534d93cb6 100644 > --- a/include/linux/pid_namespace.h > +++ b/include/linux/pid_namespace.h > @@ -32,10 +32,6 @@ struct pid_namespace { > struct kmem_cache *pid_cachep; > unsigned int level; > struct pid_namespace *parent; > -#ifdef CONFIG_PROC_FS > - struct dentry *proc_self; > - struct dentry *proc_thread_self; > -#endif > #ifdef CONFIG_BSD_PROCESS_ACCT > struct fs_pin *bacct; > #endif > diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h > index 40a7982b7285..5920a4ecd71b 100644 > --- a/include/linux/proc_fs.h > +++ b/include/linux/proc_fs.h > @@ -27,6 +27,17 @@ struct proc_ops { > unsigned long (*proc_get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); > }; > > +struct proc_fs_info { > + struct pid_namespace *pid_ns; > + struct dentry *proc_self; /* For /proc/self */ > + struct dentry *proc_thread_self; /* For /proc/thread-self */ > +}; > + > +static inline struct proc_fs_info *proc_sb_info(struct super_block *sb) > +{ > + return sb->s_fs_info; > +} > + > #ifdef CONFIG_PROC_FS > > typedef int (*proc_write_t)(struct file *, char *, size_t); > @@ -161,6 +172,7 @@ int open_related_ns(struct ns_common *ns, > /* get the associated pid namespace for a file in procfs */ > static inline struct pid_namespace *proc_pid_ns(const struct inode *inode) > { > + return proc_sb_info(inode->i_sb)->pid_ns; > return inode->i_sb->s_fs_info; > }