On Wed, Aug 05, 2020 at 05:43:02PM -0700, Nick Kralevich wrote: > On Fri, Jul 24, 2020 at 6:40 AM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: > > > > On Thu, Jul 23, 2020 at 05:13:28PM -0700, Nick Kralevich wrote: > > > On Thu, Jul 23, 2020 at 10:30 AM Lokesh Gidra <lokeshgidra@xxxxxxxxxx> wrote: > > > > From the discussion so far it seems that there is a consensus that > > > > patch 1/2 in this series should be upstreamed in any case. Is there > > > > anything that is pending on that patch? > > > > > > That's my reading of this thread too. > > > > > > > > > Unless I'm mistaken that you can already enforce bit 1 of the second > > > > > > parameter of the userfaultfd syscall to be set with seccomp-bpf, this > > > > > > would be more a question to the Android userland team. > > > > > > > > > > > > The question would be: does it ever happen that a seccomp filter isn't > > > > > > already applied to unprivileged software running without > > > > > > SYS_CAP_PTRACE capability? > > > > > > > > > > Yes. > > > > > > > > > > Android uses selinux as our primary sandboxing mechanism. We do use > > > > > seccomp on a few processes, but we have found that it has a > > > > > surprisingly high performance cost [1] on arm64 devices so turning it > > > > > on system wide is not a good option. > > > > > > > > > > [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad > > > > > > As Jeff mentioned, seccomp is used strategically on Android, but is > > > not applied to all processes. It's too expensive and impractical when > > > simpler implementations (such as this sysctl) can exist. It's also > > > significantly simpler to test a sysctl value for correctness as > > > opposed to a seccomp filter. > > > > Given that selinux is already used system-wide on Android, what is wrong > > with using selinux to control userfaultfd as opposed to seccomp? > > Userfaultfd file descriptors will be generally controlled by SELinux. > You can see the patchset at > https://lore.kernel.org/lkml/20200401213903.182112-3-dancol@xxxxxxxxxx/ > (which is also referenced in the original commit message for this > patchset). However, the SELinux patchset doesn't include the ability > to control FAULT_FLAG_USER / UFFD_USER_MODE_ONLY directly. > > SELinux already has the ability to control who gets CAP_SYS_PTRACE, > which combined with this patch, is largely equivalent to direct > UFFD_USER_MODE_ONLY checks. Additionally, with the SELinux patch > above, movement of userfaultfd file descriptors can be mediated by > SELinux, preventing one process from acquiring userfaultfd descriptors > of other processes unless allowed by security policy. > > It's an interesting question whether finer-grain SELinux support for > controlling UFFD_USER_MODE_ONLY should be added. I can see some > advantages to implementing this. However, we don't need to decide that > now. > > Kernel security checks generally break down into DAC (discretionary > access control) and MAC (mandatory access control) controls. Most > kernel security features check via both of these mechanisms. Security > attributes of the system should be settable without necessarily > relying on an LSM such as SELinux. This patch follows the same basic > model -- system wide control of a hardening feature is provided by the > unprivileged_userfaultfd_user_mode_only sysctl (DAC), and if needed, > SELinux support for this can also be implemented on top of the DAC > controls. > > This DAC/MAC split has been successful in several other security > features. For example, the ability to map at page zero is controlled > in DAC via the mmap_min_addr sysctl [1], and via SELinux via the > mmap_zero access vector [2]. Similarly, access to the kernel ring > buffer is controlled both via DAC as the dmesg_restrict sysctl [3], as > well as the SELinux syslog_read [2] check. Indeed, the dmesg_restrict > sysctl is very similar to this patch -- it introduces a capability > (CAP_SYSLOG, CAP_SYS_PTRACE) check on access to a sensitive resource. > > If we want to ensure that a security feature will be well tested and > vetted, it's important to not limit its use to LSMs only. This ensures > that kernel and application developers will always be able to test the > effects of a security feature, without relying on LSMs like SELinux. > It also ensures that all distributions can enable this security > mitigation should it be necessary for their unique environments, > without introducing an SELinux dependency. And this patch does not > preclude an SELinux implementation should it be necessary. > > Even if we decide to implement fine-grain SELinux controls on > UFFD_USER_MODE_ONLY, we still need this patch. We shouldn't make this > an either/or choice between SELinux and this patch. Both are > necessary. > > -- Nick > > [1] https://wiki.debian.org/mmap_min_addr > [2] https://selinuxproject.org/page/NB_ObjectClassesPermissions > [3] https://www.kernel.org/doc/Documentation/sysctl/kernel.txt I am not sure I agree this is similar to dmesg access. The reason I say it is this: it is pretty easy for admins to know whether they run something that needs to access the kernel ring buffer. Or if it's a tool developer poking at dmesg, they can tell admins "we need these permissions". But it seems impossible for either an admin to know that a userfaultfd page e.g. used with shared memory is accessed from the kernel. So I guess the question is: how does anyone not running Android know to set this flag? I got the feeling it's not really possible, and so for a single-user feature like this a single API seems enough. Given a choice between a knob an admin is supposed to set and selinux policy written by presumably knowledgeable OS vendors, I'd opt for a second option. Hope this helps. > > > > > > > > > > > > > > > > > > > > > > If answer is "no" the behavior of the new sysctl in patch 2/2 (in > > > > > > subject) should be enforceable with minor changes to the BPF > > > > > > assembly. Otherwise it'd require more changes. > > > > > > It would be good to understand what these changes are. > > > > > > > > > Why exactly is it preferable to enlarge the surface of attack of the > > > > > > kernel and take the risk there is a real bug in userfaultfd code (not > > > > > > just a facilitation of exploiting some other kernel bug) that leads to > > > > > > a privilege escalation, when you still break 99% of userfaultfd users, > > > > > > if you set with option "2"? > > > > > > I can see your point if you think about the feature as a whole. > > > However, distributions (such as Android) have specialized knowledge of > > > their security environments, and may not want to support the typical > > > usages of userfaultfd. For such distributions, providing a mechanism > > > to prevent userfaultfd from being useful as an exploit primitive, > > > while still allowing the very limited use of userfaultfd for userspace > > > faults only, is desirable. Distributions shouldn't be forced into > > > supporting 100% of the use cases envisioned by userfaultfd when their > > > needs may be more specialized, and this sysctl knob empowers > > > distributions to make this choice for themselves. > > > > > > > > > Is the system owner really going to purely run on his systems CRIU > > > > > > postcopy live migration (which already runs with CAP_SYS_PTRACE) and > > > > > > nothing else that could break? > > > > > > This is a great example of a capability which a distribution may not > > > want to support, due to distribution specific security policies. > > > > > > > > > > > > > > > Option "2" to me looks with a single possible user, and incidentally > > > > > > this single user can already enforce model "2" by only tweaking its > > > > > > seccomp-bpf filters without applying 2/2. It'd be a bug if android > > > > > > apps runs unprotected by seccomp regardless of 2/2. > > > > > > Can you elaborate on what bug is present by processes being > > > unprotected by seccomp? > > > > > > Seccomp cannot be universally applied on Android due to previously > > > mentioned performance concerns. Seccomp is used in Android primarily > > > as a tool to enforce the list of allowed syscalls, so that such > > > syscalls can be audited before being included as part of the Android > > > API. > > > > > > -- Nick > > > > > > -- > > > Nick Kralevich | nnk@xxxxxxxxxx > > > > > -- > Nick Kralevich | nnk@xxxxxxxxxx