On Thu, Sep 3, 2020 at 8:34 PM Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote: > > Hello, > > On Mon, Aug 17, 2020 at 03:11:16PM -0700, Lokesh Gidra wrote: > > There has been an emphasis that Android is probably the only user for > > the restriction of userfaults from kernel-space and that it wouldn’t > > be useful anywhere else. I humbly disagree! There are various areas > > where the PROT_NONE+SIGSEGV trick is (and can be) used in a purely > > user-space setting. Basically, any lazy, on-demand, > > For the record what I said is quoted below > https://lkml.kernel.org/r/20200520194804.GJ26186@xxxxxxxxxx : > > """It all boils down of how peculiar it is to be able to leverage only > the acceleration [..] Right now there's a single user that can cope > with that limitation [..] If there will be more users [..] it'd be > fine to add a value "2" later.""" > > Specifically I never said "that it wouldn’t be useful anywhere else.". > Thanks a lot for clarifying. > Also I'm only arguing about the sysctl visible kABI change in patch > 2/2: the flag passed as parameter to the syscall in patch 1/2 is all > great, because seccomp needs it in the scalar parameter of the syscall > to implement a filter equivalent to your sysctl "2" policy with only > patch 1/2 applied. > > I've two more questions now: > > 1) why don't you enforce the block of kernel initiated faults with > seccomp-bpf instead of adding a sysctl value 2? Is the sysctl just > an optimization to remove a few instructions per syscall in the bpf > execution of Android unprivileged apps? You should block a lot of > other syscalls by default to all unprivileged processes, including > vmsplice. > > In other words if it's just for Android, why can't Android solve it > with only patch 1/2 by tweaking the seccomp filter? I would let Nick (nnk@) and Jeff (jeffv@) respond to this. The previous responses from both of them on this email thread (https://lore.kernel.org/lkml/CABXk95A-E4NYqA5qVrPgDF18YW-z4_udzLwa0cdo2OfqVsy=SQ@xxxxxxxxxxxxxx/ and https://lore.kernel.org/lkml/CAFJ0LnGfrzvVgtyZQ+UqRM6F3M7iXOhTkUBTc+9sV+=RrFntyQ@xxxxxxxxxxxxxx/) suggest that the performance overhead of seccomp-bpf is too much. Kees also objected to it (https://lore.kernel.org/lkml/202005200921.2BD5A0ADD@keescook/) I'm not familiar with how seccomp-bpf works. All that I can add here is that userfaultfd syscall is usually not invoked in a performance critical code path. So, if the performance overhead of seccomp-bpf (if enabled) is observed on all syscalls originating from a process, then I'd say patch 2/2 is essential. Otherwise, it should be ok to let seccomp perform the same functionality instead. > > 2) given that Android is secure enough with the sysctl at value 2, why > should we even retain the current sysctl 0 semantics? Why can't > more secure systems just use seccomp and block userfaultfd, as it > is already happens by default in the podman default seccomp > whitelist (for those containers that don't define a new json > whitelist in the OCI schema)? Shouldn't we focus our energy in > making containers more secure by preventing the OCI schema of a > random container to re-enable userfaultfd in the container seccomp > filter instead of trying to solve this with a global sysctl? > > What's missing in my view is a kubernetes hard allowlist/denylist > that cannot be overridden with the OCI schema in case people has > the bad idea of running containers downloaded from a not fully > trusted source, without adding virt isolation and that's an > userland problem to be solved in the container runtime, not a > kernel issue. Then you'd just add userfaultfd to the json of the > k8s hard seccomp denylist instead of going around tweaking sysctl. > > What's your take in changing your 2/2 patch to just replace value "0" > and avoid introducing a new value "2"? SGTM. Disabling uffd completely for unprivileged processes can be achieved either using seccomp-bpf, or via SELinux, once the following patch series is upstreamed https://lore.kernel.org/lkml/20200827063522.2563293-1-lokeshgidra@xxxxxxxxxx/ > > The value "0" was motivated by the concern that uffd can enlarge the > race window for use after free by providing one more additional way to > block kernel faults, but value "2" is already enough to solve that > concern completely and it'll be the default on all Android. > > In other words by adding "2" you're effectively doing a more > finegrined and more optimal implementation of "0" that remains useful > and available to unprivileged apps and it already resolves all > "robustness against side effects other kernel bugs" concerns. Clearly > "0" is even more secure statistically but that would apply to every > other syscall including vmsplice, and there's no > /proc/sys/vm/unprivileged_vmsplice sysctl out there. > > The next issue we have now is with the pipe mutex (which is not a > major concern but we need to solve it somehow for correctness). So I > wonder if should make the default value to be "0" (or "2" if think we > should not replace "0") and to allow only user initiated faults by > default. > > Changing the sysctl default to be 0, will make live migration fail to > switch to postcopy which will be (unnoticeable to the guest), instead > of risking the VM to be killed because of network latency > outlier. Then we wouldn't need to change the pipe code at all. > SGTM. I can change the default value to '0' (or '2') in the next revision of patch 2/2, unless somebody objects to this. > Alternatively we could still fix the pipe code so it runs better (but > it'll be more complex) or to disable uffd faults only in the pipe > code. > > One thing to keep in mind is that if we change the default, then > hypervisor hosts running QEMU would need to set: > > vm.userfaultfd = 1 > > in /etc/sysctl.conf if postcopy live migration is required, that's not > particularly concerning constraint for qemu (there are likely other > tweaks required and it looks less risky than an arbitrary timeout > which could kill the VM: if the above is forgotten the postcopy live > migration won't even start and it'll be unnoticeable to the guest). > > The main concern really are future apps that may want to use uffd for > kernel initiated faults won't be allowed to do so by default anymore, > those apps will be heavily incentivated to use bounce buffers before > passing data to syscalls, similarly to the current use case of patch 2/2. > > Comments welcome, > Andrea > > PS. Another usage of uffd that remains possible without privilege with > the 2/2 patch sysctl "2" behavior (besides the strict SIGSEGV > acceleration) is the UFFD_FEATURE_SIGBUS. That's good so a malloc lib > will remain possible without requiring extra privileges, by adding a > UFFDIO_POPULATE to use in combination with UFFD_FEATURE_SIGBUS > (UFFDIO_POPULATE just needs to zero out a page and map it, it'll be > indistinguishable to UFFDIO_ZEROPAGE but it will solve the last > performance bottleneck by avoiding a wrprotect fault after the > allocation and it will be THP capable too). Memory will be freed with > MADV_DONTNEED, without ever having to call mmap/mumap. It could move > memory around with UFFDIO_COPY+MADV_DONTNEED or by adding UFFDIO_REMAP > which already exists. > UFFDIO_POPULATE sounds like a really useful feature. I don't see it in the kernel yet. Is there a patch under work on this? If so, kindly share.