On Tue, Jul 19, 2022 at 12:56:25PM -0700, Axel Rasmussen wrote: > Historically, it has been shown that intercepting kernel faults with > userfaultfd (thereby forcing the kernel to wait for an arbitrary amount > of time) can be exploited, or at least can make some kinds of exploits > easier. So, in 37cd0575b8 "userfaultfd: add UFFD_USER_MODE_ONLY" we > changed things so, in order for kernel faults to be handled by > userfaultfd, either the process needs CAP_SYS_PTRACE, or this sysctl > must be configured so that any unprivileged user can do it. > > In a typical implementation of a hypervisor with live migration (take > QEMU/KVM as one such example), we do indeed need to be able to handle > kernel faults. But, both options above are less than ideal: > > - Toggling the sysctl increases attack surface by allowing any > unprivileged user to do it. > > - Granting the live migration process CAP_SYS_PTRACE gives it this > ability, but *also* the ability to "observe and control the > execution of another process [...], and examine and change [its] > memory and registers" (from ptrace(2)). This isn't something we need > or want to be able to do, so granting this permission violates the > "principle of least privilege". > > This is all a long winded way to say: we want a more fine-grained way to > grant access to userfaultfd, without granting other additional > permissions at the same time. > > To achieve this, add a /dev/userfaultfd misc device. This device > provides an alternative to the userfaultfd(2) syscall for the creation > of new userfaultfds. The idea is, any userfaultfds created this way will > be able to handle kernel faults, without the caller having any special > capabilities. Access to this mechanism is instead restricted using e.g. > standard filesystem permissions. > > Signed-off-by: Axel Rasmussen <axelrasmussen@xxxxxxxxxx> Thanks, this looks much better. Acked-by: Peter Xu <peterx@xxxxxxxxxx> -- Peter Xu