Hello Lokesh, On Wed, Oct 07, 2020 at 01:26:55PM -0700, Lokesh Gidra wrote: > On Wed, Sep 23, 2020 at 11:56 PM Lokesh Gidra <lokeshgidra@xxxxxxxxxx> wrote: > > > > This patch series is split from [1]. The other series enables SELinux > > support for userfaultfd file descriptors so that its creation and > > movement can be controlled. > > > > It has been demonstrated on various occasions that suspending kernel > > code execution for an arbitrary amount of time at any access to > > userspace memory (copy_from_user()/copy_to_user()/...) can be exploited > > to change the intended behavior of the kernel. For instance, handling > > page faults in kernel-mode using userfaultfd has been exploited in [2, 3]. > > Likewise, FUSE, which is similar to userfaultfd in this respect, has been > > exploited in [4, 5] for similar outcome. > > > > This small patch series adds a new flag to userfaultfd(2) that allows > > callers to give up the ability to handle kernel-mode faults with the > > resulting UFFD file object. It then adds a 'user-mode only' option to > > the unprivileged_userfaultfd sysctl knob to require unprivileged > > callers to use this new flag. > > > > The purpose of this new interface is to decrease the chance of an > > unprivileged userfaultfd user taking advantage of userfaultfd to > > enhance security vulnerabilities by lengthening the race window in > > kernel code. > > > > [1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@xxxxxxxxxx/ > > [2] https://duasynt.com/blog/linux-kernel-heap-spray > > [3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit I've looking at those links and I've been trying to verify the link [3] is relevant. Specifically I've been trying to verify if 1) current state of the art modern SLUB randomization techniques already enabled in production and rightfully wasting some CPU in all enterprise kernels to prevent things like above to become an issue in practice 2) combined with the fact different memcg need to share the same kmemcaches (which was incidentally fixed a few months ago upstream) and 3) further robustness enhancements against exploits in the slub metadata, may already render the exploit [3] from 2016 irrelevant in practice. So I started by trying to reproduce [3] by building 4.5.1 with a .config with no robustness features and I booted it on fedora-32 or gentoo userland and I cannot even invoke call_usermodehelper. Calling socket(22, AF_INET, 0) won't invoke such function. Can you reproduce on 4.5.1? Which kernel .config should I use to build 4.5.1 in order for call_usermodehelper to be invoked by the exploit? Could you help to verify it? It even has uninitialized variable spawning random perrors so it doesn't give a warm fuzzy feeling: ==== int main(int argc, char **argv) { void *region, *map; ^^^^^ pthread_t uffd_thread; int uffd, msqid, i; region = (void *)mmap((void *)0x40000000, 0x2000, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_PRIVATE|MAP_ANON, -1, 0); if (!region) { perror("mmap"); exit(2); } setup_pagefault(region + 0x1000, 0x1000, 1); printf("my pid = %d\n", getpid()); if (!map) { ^^^^^^^^ perror("mmap"); ==== The whole point of being able to reproduce on 4.5.1 is then to simulate if the same exploit would also reproduce on current kernels with all enterprise default robustness features enabled. Or did anybody already verify it? Anyway the links I was playing with are all in the cover letter, the cover letter is not as important as the actual patches. The actual patches looks fine to me. The only improvement I can think of is, what about to add a printk_once to suggest to toggle the sysctl if userfaultfd bails out because the process lacks the CAP_SYS_PTRACE capability? That would facilitate the /etc/sysctl.conf or tuned tweaking in case the apps aren't verbose enough. It's not relevant anymore with this latest patchset, but about the previous argument that seccomp couldn't be used in all Android processes because of performance concern, I'm slightly confused. https://android-developers.googleblog.com/2017/07/seccomp-filter-in-android-o.html "Android O includes a single seccomp filter installed into zygote, the process from which all the Android applications are derived. Because the filter is installed into zygote—and therefore all apps—the Android security team took extra caution to not break existing apps" Example: $ uname -mo aarch64 Android $ cat swapoff.c #include <sys/swap.h> int main() { swapoff(""); } $ gcc swapoff.c -o swapoff -O2 $ ./swapoff Bad system call $ It's hard to imagine what is more performance critical than the zygote process and the actual apps as above? It's also hard to imagine what kind of performance concern can arise by adding seccomp filters also to background system apps that generally should consume ~0% of CPU. If performance is really a concern, the BPF JIT representation with the bitmap to be able to run the filter in O(1) sounds a better solution than not adding ad-hoc filters and it's being worked on for x86-64 and can be ported to aarch64 too. Many of the standalone background processes likely wouldn't even use uffd at all so you could block the user initiated faults too that way. Ultimately because of issues as [3] (be them still relevant or not, to be double checked), no matter if through selinux, seccomp or a different sysctl value, without this patchset applied the default behavior of the userfaultfd syscall for all Linux binaries running on Android kernels, would deviate from the upstream kernel. So even if we would make the pipe mutex logic more complex the deviation would remain. Your patchset adds much less risk of breakage than adding a timeout to kernel initiated userfaults and it resolves all concerns as well as a timeout. We'll also make better use of the "0" value this way. So while I'm not certain this is the best for the long term, this looks the sweet spot for the short term to resolve many issues at once. Thanks! Andrea