Hello Jonathan and everyone, On Thu, May 07, 2020 at 01:15:03PM -0600, Jonathan Corbet wrote: > On Wed, 6 May 2020 15:38:16 -0400 > Peter Xu <peterx@xxxxxxxxxx> wrote: > > > If this is going to be added... I am thinking whether it should be easier to > > add another value for unprivileged_userfaultfd, rather than a new sysctl. E.g.: > > > > "0": unprivileged userfaultfd forbidden > > "1": unprivileged userfaultfd allowed (both user/kernel faults) > > "2": unprivileged userfaultfd allowed (only user faults) > > > > Because after all unprivileged_userfaultfd_user_mode_only will be meaningless > > (iiuc) if unprivileged_userfaultfd=0. The default value will also be the same > > as before ("1") > It occurs to me to wonder whether this interface should also let an admin > block *privileged* user from handling kernel-space faults? In a > secure-boot/lockdown setting, this could be a hardening measure that keeps > a (somewhat) restricted root user from expanding their privilege...? That's a good question. In my view if as root in lockdown mode you can still run the swapon syscall and setup nfs or other network devices and load userland fuse filesystems or cuse chardev in userland, even if you prevent userfaultfd from blocking kernel faults, kernel faults can still be blocked by other means. That in fact tends to be true also as non root (so regardless of lockdown settings) since luser can generally load fuse filesystems. There is no fundamental integrity breakage or privilege escalation originating in userfaultfd. The only concern here is about this: "after a new use-after-free is discovered in some other part of the kernel (not related to userfaultfd), how easy it is to turn the use-after-free from a mere DoS to a more concerning privilege escalation?". userfaultfd might facilitate the exploitation, but even if you remove userfaultfd from the equation, there's still no guarantee an user-after-free won't materialize as a privilege escalation by other means. So to express it in another way: unless lockdown (no matter in which mode) is a weak probabilistic based feature and in turn it cannot provide any guarantee to begin with, userfaultfd sysctl set to 0|1|2 can't possibly make any difference to it. The best mitigation for those kind of exploits remains to randomize all kernel memory allocations, so even if the attacker can block the fault, when it's unblocked it'll pick another page, not the one that the attacker can predict it will use, so the attacker needs to repeat the race many more times and hopefully it'll DoS and destabilize the kernel before it can reproduce a privilege escalation. We got many of those randomization features in the current kernel and it's probably more important to enable those than to worry about this sysctl value. One way to have a peace of mind against all use-after-free regardless of this sysctl value, is to run each pod in a KVM instance, that's safer than disabling syscalls or kernel features. The default seccomp profiles of podman already block userfaultfd too, so there's no need of virt to get extra safety if you use containers: containers need to explicitly opt-in to enable userfaultfd through the OCI schema seccomp object. If userfaultfd is being explicitly whitelisted in the OCI schema of the container, well then you know there is a good reason for it. As a matter of fact some things are only possible to achieve with userfaultfd fully enabled. The big value uffd brings compared to trapping sigsegv is precisely to be able to handle kernel faults transparently. sigsegv can't do that because every syscall would return 1) an inconsistent retval and 2) no fault address along with the retval. The possible future uffd userland users could be: dropping JVM dirty bit, redis snapshot using pthread_create() instead of fork(), distributed shared memory on pmem, new malloc() implementation never taking mmap_sem for writing in the kernel and never modifying any vma to allocate and free anon memory, etc.. I don't think any of them would work with the sysctl set to "2". The next kernel feature in uffd land that I was discussing with Peter, is an async uffd event model to further optimize the replacement of soft-dirty (which uffd already provides in O(1) instead of O(N)), so the wrprotect fault won't have to block anymore until the uffd async queue overflows. That also is unlikely to work with the sysctl set to "2" without adding extra constraints that soft-dirty doesn't currently have. It would also be possible to implement the value "2" to work like /proc/sys/kernel/unprivileged_bpf_disabled, so when you set it to "1" as root, you can't set it to "2" or "0" and when you set it to "2" you can't set it to "0", but personally I think it's unnecessary. Thanks, Andrea