Hello, On Mon, Aug 17, 2020 at 03:11:16PM -0700, Lokesh Gidra wrote: > There has been an emphasis that Android is probably the only user for > the restriction of userfaults from kernel-space and that it wouldn’t > be useful anywhere else. I humbly disagree! There are various areas > where the PROT_NONE+SIGSEGV trick is (and can be) used in a purely > user-space setting. Basically, any lazy, on-demand, For the record what I said is quoted below https://lkml.kernel.org/r/20200520194804.GJ26186@xxxxxxxxxx : """It all boils down of how peculiar it is to be able to leverage only the acceleration [..] Right now there's a single user that can cope with that limitation [..] If there will be more users [..] it'd be fine to add a value "2" later.""" Specifically I never said "that it wouldn’t be useful anywhere else.". Also I'm only arguing about the sysctl visible kABI change in patch 2/2: the flag passed as parameter to the syscall in patch 1/2 is all great, because seccomp needs it in the scalar parameter of the syscall to implement a filter equivalent to your sysctl "2" policy with only patch 1/2 applied. I've two more questions now: 1) why don't you enforce the block of kernel initiated faults with seccomp-bpf instead of adding a sysctl value 2? Is the sysctl just an optimization to remove a few instructions per syscall in the bpf execution of Android unprivileged apps? You should block a lot of other syscalls by default to all unprivileged processes, including vmsplice. In other words if it's just for Android, why can't Android solve it with only patch 1/2 by tweaking the seccomp filter? 2) given that Android is secure enough with the sysctl at value 2, why should we even retain the current sysctl 0 semantics? Why can't more secure systems just use seccomp and block userfaultfd, as it is already happens by default in the podman default seccomp whitelist (for those containers that don't define a new json whitelist in the OCI schema)? Shouldn't we focus our energy in making containers more secure by preventing the OCI schema of a random container to re-enable userfaultfd in the container seccomp filter instead of trying to solve this with a global sysctl? What's missing in my view is a kubernetes hard allowlist/denylist that cannot be overridden with the OCI schema in case people has the bad idea of running containers downloaded from a not fully trusted source, without adding virt isolation and that's an userland problem to be solved in the container runtime, not a kernel issue. Then you'd just add userfaultfd to the json of the k8s hard seccomp denylist instead of going around tweaking sysctl. What's your take in changing your 2/2 patch to just replace value "0" and avoid introducing a new value "2"? The value "0" was motivated by the concern that uffd can enlarge the race window for use after free by providing one more additional way to block kernel faults, but value "2" is already enough to solve that concern completely and it'll be the default on all Android. In other words by adding "2" you're effectively doing a more finegrined and more optimal implementation of "0" that remains useful and available to unprivileged apps and it already resolves all "robustness against side effects other kernel bugs" concerns. Clearly "0" is even more secure statistically but that would apply to every other syscall including vmsplice, and there's no /proc/sys/vm/unprivileged_vmsplice sysctl out there. The next issue we have now is with the pipe mutex (which is not a major concern but we need to solve it somehow for correctness). So I wonder if should make the default value to be "0" (or "2" if think we should not replace "0") and to allow only user initiated faults by default. Changing the sysctl default to be 0, will make live migration fail to switch to postcopy which will be (unnoticeable to the guest), instead of risking the VM to be killed because of network latency outlier. Then we wouldn't need to change the pipe code at all. Alternatively we could still fix the pipe code so it runs better (but it'll be more complex) or to disable uffd faults only in the pipe code. One thing to keep in mind is that if we change the default, then hypervisor hosts running QEMU would need to set: vm.userfaultfd = 1 in /etc/sysctl.conf if postcopy live migration is required, that's not particularly concerning constraint for qemu (there are likely other tweaks required and it looks less risky than an arbitrary timeout which could kill the VM: if the above is forgotten the postcopy live migration won't even start and it'll be unnoticeable to the guest). The main concern really are future apps that may want to use uffd for kernel initiated faults won't be allowed to do so by default anymore, those apps will be heavily incentivated to use bounce buffers before passing data to syscalls, similarly to the current use case of patch 2/2. Comments welcome, Andrea PS. Another usage of uffd that remains possible without privilege with the 2/2 patch sysctl "2" behavior (besides the strict SIGSEGV acceleration) is the UFFD_FEATURE_SIGBUS. That's good so a malloc lib will remain possible without requiring extra privileges, by adding a UFFDIO_POPULATE to use in combination with UFFD_FEATURE_SIGBUS (UFFDIO_POPULATE just needs to zero out a page and map it, it'll be indistinguishable to UFFDIO_ZEROPAGE but it will solve the last performance bottleneck by avoiding a wrprotect fault after the allocation and it will be THP capable too). Memory will be freed with MADV_DONTNEED, without ever having to call mmap/mumap. It could move memory around with UFFDIO_COPY+MADV_DONTNEED or by adding UFFDIO_REMAP which already exists.