Hello Kees, On Wed, May 20, 2020 at 11:03:39AM -0700, Kees Cook wrote: > Err, did I miss a separate 6-patch series? I can't find anything on lore. Daniel included the link of the previous series I referred to is the cover letter 0/2: https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@xxxxxxxxxx/ > > If you keep only 1/2, can't seccomp-bpf enforce userfaultfd to be > > always called with flags==0x1 without requiring extra modifications in > > the kernel? > > Please no. This is way too much overhead for something that a system > owner wants to enforce globally. A sysctl is the correct option here, > IMO. If it needs to be a per-userns sysctl, that would be fine too. The question is who could be this system owner who prefers "2" to "0"? per-ns I don't see the point either when all containers already run with default policies enforcing the same behavior as if the sysctl is set to "0". Why exactly is it preferable to enlarge the surface of attack of the kernel and take the risk there is a real bug in userfaultfd code (not just a facilitation of exploiting some other kernel bug) that leads to a privilege escalation, when you still break 99% of userfaultfd users, if you set with option "2"? Is the system owner really going to purely run on his systems CRIU postcopy live migration (which already runs with CAP_SYS_PTRACE) and nothing else that could break? Option "2" to me looks with a single possible user, and incidentally this single user can already enforce model "2" by only tweaking its seccomp-bpf filters without applying 2/2. It'd be a bug if android apps runs unprotected by seccomp regardless of 2/2. System owners I think would be better of to stick to "0" or "1" a far as I can tell, "2" looks a bad tradeoff as system value, with nearly all cons of "0", but less secure than "0". > I'd agree that patch 1 should land, as it appears to be required for any Agreed about merging 1/2. > further policy considerations. I'm still a big fan of a sysctl since > this is the kind of thing I would absolutely turn on globally for all my > systems. The sysctl /proc/sys/kernel/unprivileged_bpf_disabled is already there upstream and you should have already set it to "0" in all your systems if you can cope with some app features not working. 2/2 as modified with Peter's suggestion, would add a new value "2" (in addition of "1" and "0"), that still breaks the majority of all possible users, just like value "0", but that gives less security than value "0". It all boils down of how peculiar it is to be able to leverage only the acceleration (reduction in context switches and enter/exit kernel) and the vma fragmentation avoidance (to avoid running of vmas in /proc/sys/vm/max_map_count) provided by userfaultfd (vs sigsegv trapping) but not the transparency in handling faults in kernel (which sigsegv can't possibly achieve). Right now there's a single user that can cope with that limitation, and it's not your running on your servers but on your phone. Even CRIU cannot cope with such limitation, the only reason it would cope with value "2" is that it already runs with CAP_SYS_PTRACE for other reasons. If there will be more users that can cope with handling only user initiated page faults, then yes, I think it'd be fine to add a value "2" later. The other benefit of enforcing the policy with seccomp-bpf if that if you'd run Android userland on a container on a ARM server on top of an enterprise kernel, it'd already run as safe as if the sysctl value was tweaked to "2", but without having to also add extra kernel code for per-ns sysctl. It just looks simpler. The rest of the containers on the same host are already running today as if the sysctl is set to 0 regardless of this patchset. If you want to enforce maximum security and override any possible opt-out of the default podman seccomp profile that blocks userfaultfd, you already can upstream with the global sysctl by setting it to "0". Thanks, Andrea