Re: [PATCH RFC v2 4/6] proc: support mounting private procfs instances inside same pid namespace

Andy Lutomirski <luto@xxxxxxxxxx> · Wed, 26 Apr 2017 15:13:30 -0700

On Tue, Apr 25, 2017 at 5:23 AM, Djalal Harouni <tixxdz@xxxxxxxxx> wrote:
> This patch allows to have multiple private procfs instances inside the
> same pid namespace. Lot of other areas in the kernel and filesystems
> have been updated to be able to support private instances, devpts is one
> major example. The aim here is lightweight sandboxes, and to allow that we
> have to modernize procfs internals.
>
> 1) The main aim of this work is to have on embedded systems one
> supervisor for apps. Right now we have some lightweight sandbox support,
> however if we create pid namespacess we have to manages all the
> processes inside too, where our goal is to be able to run a bunch of
> apps each one inside its own mount namespace without being able to
> notice each other. We only want to use mount namespaces, and we want
> procfs to behave more like a real mount point.
>
> 2) Linux Security Modules have multiple ptrace paths inside some
> subsystems, however inside procfs, the implementation does not guarantee
> that the ptrace() check which triggers the security_ptrace_check() hook
> will always run. We have the 'hidepid' mount option that can be used to
> force the ptrace_may_access() check inside has_pid_permissions() to run.
> The problem is that 'hidepid' is per pid namespace and not attached to
> the mount point, any remount or modification of 'hidepid' will propagate
> to all other procfs mounts.
>
> This also does not allow to support Yama LSM easily in desktop and user
> sessions. Yama ptrace scope which restricts ptrace and some other
> syscalls to be allowed only on inferiors, can be updated to have a
> per-task context, where the context will be inherited during fork(),
> clone() and preserved across execve(). If we support multiple private
> procfs instances, then we may force the ptrace_may_access() on
> /proc/<pids>/ to always run inside that new procfs instances. This will
> allow to specifiy on user sessions if we should populate procfs with
> pids that the user can ptrace or not.
>
> By using Yama ptrace scope, some restricted users will only be able to see
> inferiors inside /proc, they won't even be able to see their other
> processes. Some software like Chromium, Firefox's crash handler, Wine
> and others are already using Yama to restrict which processes can be
> ptracable. With this change this will give the possibility to restrict
> /proc/<pids>/ but more importantly this will give desktop users a
> generic and usuable way to specifiy which users should see all processes
> and which users can not.
>
> Side notes:
> * This covers the lack of seccomp where it is not able to parse
> arguments, it is easy to install a seccomp filter on direct syscalls
> that operate on pids, however /proc/<pid>/ is a Linux ABI using
> filesystem syscalls. With this change LSMs should be able to analyze
> open/read/write/close...
>
> 3) This will modernize procfs and align it with all other filesystems
> and subsystems that have been updated recently to be able to work in a
> flexible way. This is the same as devpts where each mount now is a distinct
> filesystem such that ptys and their indicies allocated in one mount are
> independent from ptys and their indicies in all other mounts.
>
> We have to align procfs and modernize it to have a per mount context
> where at least the mount option do not propagate to all other mounts,
> then maybe we can continue to implement new features. One example is to
> require CAP_SYS_ADMIN in the init user namespace on some /proc/* which are
> not pids and which are are not virtualized by design, or CAP_NET_ADMIN
> inside userns on the net bits that are virtualized, etc.
> These mount options won't propagate to previous mounts, and the system
> will continue to be usable.
>
> Ths patch introduces the new 'limit_pids' mount option as it was also
> suggesed by Andy Lutomirski [1]. When this option is passed we
> automatically create a private procfs instance. This is not the default
> behaviour since we do not want to break userspace and we do not want to
> provide different devices IDs by default, please see [1] for why.

I think that calling the option to make a separate instance
"limit_pids" is extremely counterintuitive.

My strong preference would be to make proc *always* make a separate
instance (unless it's a bind mount) and to make it work.  If that
means fudging stat() output, so be it.

Failing that, let's come up with some coherent way to make this work.
"new_instance" or similar would do.  Then make limit_pid cause an
error unless new_instance is also set.

--Andy