Hello! On Sat, Aug 01, 2020 at 10:39:00AM -0700, Linus Torvalds wrote: > On Sat, Aug 1, 2020 at 8:30 AM Tetsuo Handa > <penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote: > > > > Waiting for response at https://lkml.kernel.org/r/45a9b2c8-d0b7-8f00-5b30-0cfe3e028b28@xxxxxxxxxxxxxxxxxxx . > > I think handle_userfault() should have a (shortish) timeout, and just > return VM_FAULT_RETRY. The 1sec timeout if applied only to kernel faults (not the case yet but it'd be enough to solve the hangcheck timer), will work perfectly for Android, but it will break qemu. [ 916.954313] INFO: task syz-executor.0:61593 blocked for more than 40 seconds. If you want to enforce a timeout, 40 seconds or something of the order of the hangcheck timer would be more reasonable. 1sec is of the same order of magnitude of latency that you'd get with an host kernel upgrade in place with kexec (with the guest memory being preserved in RAM) that you'd suffer occasionally from in most public clouds. So postcopy live migration should be allowed to take 1 sec latency and it shouldn't become a deal breaker, that results in the VM getting killed. > The code is overly complex anyway, because it predates the "just return RETRY". > > And because we can't wait forever when the source of the fault is a > kernel exception, I think we should add some extra logic to just say > "if this is a retry, we've already done this once, just return an > error". Until the uffp-wp was merged recently, we never needed more than one VM_FAULT_RETRY to handle uffd-missing faults, you seem to want to go back to that which again would be fine for uffd-missing faults. I haven't had time to read and test the testcase properly yet, but at first glance from reading the hangcheck report it looks like there would be just one userfault? So I don't see an immediate connection. The change adding a 1sec timeout would definitely fix this issue, but it'll also break qemu and probably the vast majority of the users. > This is a TEST PATCH ONLY. I think we'll actually have to do something > like this, but I think the final version might need to allow a couple > of retries, rather than just give up after just one second. > > But for testing your case, this patch might be enough to at least show > that "yeah, this kind of approach works". > > Andrea? Comments? As mentioned, this is probably much too aggressive, > but I do think we need to limit the time that the kernel will wait for > page faults. Why is pipe preventing to SIGKILL the task that is blocked on the mutex_lock? Is there any good reason for it or it simply has margin for improvement regardless of the hangcheck report? It'd be great if we can look into that before looking into the uffd specific bits. The hangcheck timer would have zero issues with tasks that can be killed, if only the pipe code could be improved to use mutex_lock_killable. /* use "==" to skip the TASK_KILLABLE tasks waiting on NFS */ if (t->state == TASK_UNINTERRUPTIBLE) check_hung_task(t, timeout); The hangcheck report is just telling us one task was in D state a little too long, but it wasn't fatal error and the kernel wasn't actually destabilized and the only malfunction reported is that a task was unkillable for too long. Now if it's impossible to improve the pipe code so it works better not just for uffd, there's still no reason to worry: we could disable uffd in the pipe context. For example ptrace opts-out of uffds, so that gdb doesn't get stuck if you read a pointer that should be handled by the process that is under debug. I hope it won't be necessary but it wouldn't be a major issue, certainly it wouldn't risk breaking qemu (and non-cooperative APIs are privileged so it could still skip the timeout). > Because userfaultfd has become a huge source of security holes as a > way to time kernel faults or delay them indefinitely. I assume you refer to the below: https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit https://blog.lizzie.io/using-userfaultfd.html These reports don't happen to mention CONFIG_SLAB_FREELIST_RANDOM=y which is already enabled by default in enterprise kernels for a reason and they don't mention the more recent CONFIG_SLAB_FREELIST_HARDENED and CONFIG_SHUFFLE_PAGE_ALLOCATOR. Can they test it with those options enabled again, does it still work so good or not anymore? That would be very helpful to know. Randomizing which is the next page that gets allocated is much more important than worrying about uffd because if you removed uffd you may still have other ways to temporarily stop the page fault depending on the setup. Example: https://bugs.chromium.org/p/project-zero/issues/detail?id=808 The above one doesn't use uffd, but it uses fuse. So is fuse also a source of security holes given they even use it for the exploit in a preferential way instead of uffd? "This can be done by abusing the writev() syscall and FUSE: The attacker mounts a FUSE filesystem that artificially delays read accesses, then mmap()s a file containing a struct iovec from that FUSE filesystem and passes the result of mmap() to writev(). (Another way to do this would be to use the userfaultfd() syscall.)" It's not just uffd and fuse: even if you set /proc/sys/vm/unprivileged_userfaultfd to 0 and you drop fuse, swapin and pagein from NFS can be attacked through the network. The fact there's a timeout won't make those less useful. Enlarging the race window to 1sec or 40sec won't help much. What would be useful if something is to add a delay to the wakeup event, not to add a timeout, but even if we do that, the bug may still be reproducible and it may eventually win the race regardless, by just waiting longer. It's impossible to obtain maximum features, maximum security and maximum performance all at the same time, something has to give in and we clearly want to support the enhanced-robustness secure setup where all other unprivileged_* sysctl are tweaked too (not just uffd), and we already do with the sysctl we added for it. In this respect uffd is not different from other features that also shouldn't be accessible without privilege in those setup. It's part of the tradeoff. Most important the default container runtime seccomp filters blocks uffd so all containers are at zero risk unless they use uffd actively and opt-in explicitly using the OCI schema seccomp filter, in which case the previous sentence applies. Anybody running a secure setup but not wrapping everything under at least the default seccomp filter of the container runtime, is not really secure anyway so even the sysctl is meaningless in reality. Way more useful than the sysctl in practice is that container runtime needs an hard denylist/allowlist that cannot be opted out of the OCI schema and in paranoid setups some syscalls could be added to it, despite it may break stuff. Randomizing the allocations so the timing don't matter anymore is most certainly worth it because it will work not just for uffd, but also for fuse, swapin from nfs under malicious network flood and any other source of page fault controllable stalls. The last complaint received on the uffd security topic was the suggestion that uffd being allowed to block kernel faults was a concern for lockdown confidentiality mode. I answered to that here and I repeated some part in the above as well: https://lkml.kernel.org/r/20200520040608.GB26186@xxxxxxxxxx (if no time to read the above: the short version is that lockdown confidentiality is not a robustness/probabilistic feature. It's a black and white feature and such it shouldn't even try to be robust against kernel bugs by design. If we turn confidentiality mode in an "hardening" kernel feature, then the sky is the limit and it'll become a growing black hole that will drop more and more and it won't be possible to draw a sure line where to stop dropping, until what's left will be too little or too slow to be useful) Thanks, Andrea (On the seccomp topic, and this would open a new huge thread, absolutely we need to change the default of spec_store_bypass_disable and spectre_v2_user to prctl, we can't keep it to seccomp, most userland including the container runtimes already opted-out with SECCOMP_FILTER_FLAG_SPEC_ALLOW, the longer we wait the more it'll be pointless to even leave a =seccomp option as opt-in later, I'll try to work on submitting something soon to fix this)