On Tue, Apr 25, 2017 at 5:23 AM, Djalal Harouni <tixxdz@xxxxxxxxx> wrote: > This patch allows to have multiple private procfs instances inside the > same pid namespace. Lot of other areas in the kernel and filesystems > have been updated to be able to support private instances, devpts is one > major example. The aim here is lightweight sandboxes, and to allow that we > have to modernize procfs internals. > > 1) The main aim of this work is to have on embedded systems one > supervisor for apps. Right now we have some lightweight sandbox support, > however if we create pid namespacess we have to manages all the > processes inside too, where our goal is to be able to run a bunch of > apps each one inside its own mount namespace without being able to > notice each other. We only want to use mount namespaces, and we want > procfs to behave more like a real mount point. > > 2) Linux Security Modules have multiple ptrace paths inside some > subsystems, however inside procfs, the implementation does not guarantee > that the ptrace() check which triggers the security_ptrace_check() hook > will always run. We have the 'hidepid' mount option that can be used to > force the ptrace_may_access() check inside has_pid_permissions() to run. > The problem is that 'hidepid' is per pid namespace and not attached to > the mount point, any remount or modification of 'hidepid' will propagate > to all other procfs mounts. > > This also does not allow to support Yama LSM easily in desktop and user > sessions. Yama ptrace scope which restricts ptrace and some other > syscalls to be allowed only on inferiors, can be updated to have a > per-task context, where the context will be inherited during fork(), > clone() and preserved across execve(). If we support multiple private > procfs instances, then we may force the ptrace_may_access() on > /proc/<pids>/ to always run inside that new procfs instances. This will > allow to specifiy on user sessions if we should populate procfs with > pids that the user can ptrace or not. > > By using Yama ptrace scope, some restricted users will only be able to see > inferiors inside /proc, they won't even be able to see their other > processes. Some software like Chromium, Firefox's crash handler, Wine > and others are already using Yama to restrict which processes can be > ptracable. With this change this will give the possibility to restrict > /proc/<pids>/ but more importantly this will give desktop users a > generic and usuable way to specifiy which users should see all processes > and which users can not. > > Side notes: > * This covers the lack of seccomp where it is not able to parse > arguments, it is easy to install a seccomp filter on direct syscalls > that operate on pids, however /proc/<pid>/ is a Linux ABI using > filesystem syscalls. With this change LSMs should be able to analyze > open/read/write/close... > > 3) This will modernize procfs and align it with all other filesystems > and subsystems that have been updated recently to be able to work in a > flexible way. This is the same as devpts where each mount now is a distinct > filesystem such that ptys and their indicies allocated in one mount are > independent from ptys and their indicies in all other mounts. > > We have to align procfs and modernize it to have a per mount context > where at least the mount option do not propagate to all other mounts, > then maybe we can continue to implement new features. One example is to > require CAP_SYS_ADMIN in the init user namespace on some /proc/* which are > not pids and which are are not virtualized by design, or CAP_NET_ADMIN > inside userns on the net bits that are virtualized, etc. > These mount options won't propagate to previous mounts, and the system > will continue to be usable. > > Ths patch introduces the new 'limit_pids' mount option as it was also > suggesed by Andy Lutomirski [1]. When this option is passed we > automatically create a private procfs instance. This is not the default > behaviour since we do not want to break userspace and we do not want to > provide different devices IDs by default, please see [1] for why. I think that calling the option to make a separate instance "limit_pids" is extremely counterintuitive. My strong preference would be to make proc *always* make a separate instance (unless it's a bind mount) and to make it work. If that means fudging stat() output, so be it. Failing that, let's come up with some coherent way to make this work. "new_instance" or similar would do. Then make limit_pid cause an error unless new_instance is also set. --Andy