Hello, On Wed, Mar 13, 2019 at 09:22:31AM +0100, Paolo Bonzini wrote: > On 13/03/19 07:00, Peter Xu wrote: > >> However, I can imagine more special cases being added for other users. And, > >> once you have more than one special case then you may want to combine them. > >> For example, kvm and hugetlbfs together. > > It looks fine to me if we're using MMF_USERFAULTFD_ALLOW flag upon > > mm_struct, since that seems to be a very general flag that can be used > > by anything we want to grant privilege for, not only KVM? > > Perhaps you can remove the fork() limitation, and add a new suboption to > prctl(PR_SET_MM) that sets/resets MMF_USERFAULTFD_ALLOW. If somebody > wants to forbid unprivileged userfaultfd and use KVM, they'll have to > use libvirt or some other privileged management tool. > > We could also add support for this prctl to systemd, and then one could > do "systemd-run -pAllowUserfaultfd=yes COMMAND". systemd can already implement -pAllowUserfaultfd=no with seccomp if it wants. It can also implement -yes if by default turns off userfaultfd like firejail -seccomp would do. If the end goal is to implement the filtering with an userland policy instead of a kernel policy, seccomp enabled for all services sounds reasonable. It's very unlikely you'll block only userfaultfd, firejail -seccomp by default blocks dozen of syscalls that are unnecessary 99.9% of the time. This is not about implementing an userland flexible policy, it's just a simple kernel policy, to use until userland disables the kernel policy to takeover with seccomp across the board. I wouldn't like this too be too complicated because this is already theoretically overlapping 100% with seccomp. hugetlbfs is more complicated to detect, because even if you inherit it from fork(), the services that mounts the fs may be in a different container than the one that Oracle that uses userfaultfd later on down the road from a different context. And I don't think it would be ok to allow running userfaultfd just because you can open a file in an hugetlbfs file system. With /dev/kvm it's a bit different, that's chmod o-r by default.. no luser should be able to open it. Unless somebody suggests a consistent way to make hugetlbfs "just work" (like we could achieve clean with CRIU and KVM), I think Oracle will need a one liner change in the Oracle setup to echo into that file in addition of running the hugetlbfs mount. Note that DPDK host bridge process will also need a one liner change to do a dummy open/close of /dev/kvm to unblock the syscall. Thanks, Andrea