On Wed, Oct 23, 2019 at 12:21:18PM -0700, Andy Lutomirski wrote: > There are two things going on here. > > 1. Daniel wants to add LSM labels to userfaultfd objects. This seems > reasonable to me. The question, as I understand it, is: who is the > subject that creates a uffd referring to a forked child? I'm sure > this is solvable in any number of straightforward ways, but I think > it's less important than: The new uffd created during fork would definitely need to be accounted on the criu monitor, nor to the parent nor the child, so it'd need to be accounted to the process/context that has the fd in its file descriptors array. But since this is less important let's ignore this for a second. > 2. The existing ABI is busted independently of #1. Suppose you call > userfaultfd to get a userfaultfd and enable UFFD_FEATURE_EVENT_FORK. > Then you do: > > $ sudo <&[userfaultfd number] > > Sudo will read it and get a new fd unexpectedly added to its fd table. > It's worse if SCM_RIGHTS is involved. So the problem is just that a new fd is created. So for this to turn out to a practical issue, it requires finding a reckless suid that won't even bother checking the return value of the open/socket syscalls or some equivalent fd number related side effect. All right that makes more sense now and of course I agree it needs fixing. > So I think we either need to declare that UFFD_FEATURE_EVENT_FORK is > only usable by global root or we need to remove it and maybe re-add it > in some other form. If I had a time machine, I'd rather prefer to do the below: diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index fe6d804a38dc..574062051678 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1958,7 +1958,7 @@ SYSCALL_DEFINE1(userfaultfd, int, flags) return -ENOMEM; refcount_set(&ctx->refcount, 1); - ctx->flags = flags; + ctx->flags = flags | UFFD_CLOEXEC; ctx->features = 0; ctx->state = UFFD_STATE_WAIT_API; ctx->released = false; I mean there's no strong requirement to allow any uffd to survive exec even if UFFD_FEATURE_EVENT_FORK never existed, it's enough if it can be passed through unix domain sockets. Until UFFD_FEATURE_EVENT_FORK come around, there was no particular reason to implicitly enforce O_CLOEXEC on all uffd, it was totally possible to clone() and exec() to pass the fd to a different process. So it never rang a bell that this would turn out to be a problem after UFFD_FEATURE_EVENT_FORK was introduced. There are various ways to approach this: 1) drop all non cooperative features and mark their feature bitflags reserved (no ABI break) 2) enforce UFFD_CLOEXEC with above patch (potential ABI break all userfaultfd users) 3) enforce UFFD_CLOEXEC if UFFD_FEATURE_EVENT_FORK is set (ABI break only if UFFD_FEATURE_EVENT_FORK is set). Note all forked uffd are opened with the same flags inherited from the original uffd. 4) enforce the global root permission check when creating the uffd only if UFFD_FEATURE_EVENT_FORK is set. 5) drop all non cooperative features from API 0xaa and introduce API 0xab with the features back, but with UFFD_CLOEXEC implicitly enforced and with UFFD_CLOEXEC forbidden to be set in the flags 6) stick to API 0xaa and drop only UFFD_FEATURE_EVENT_FORK, but add a UFFD_FEATURE_EVENT_FORK2 that requires UFFD_CLOEXEC to be set (instead of implicitly enforcing it) 7) stick to API 0xaa and drop only UFFD_FEATURE_EVENT_FORK, but add a UFFD_FEATURE_EVENT_FORK2 that does the global root permission check 5 is the non-ABI-break version of 2. 6 is the non-ABI-break version of 3. 7 is the non-ABI-break version of 4. My favorite is 1) for the reason explained in the previous email. However if postcopy live migration of bare metal containers already runs in production anywhere or is at least very close to reach that milestone or if the non-cooperative features are used in production in any other way, we'd like to know where and in such case that will totally change my mind about it. In such case I'd suggest to pick any of the other options except 1). In short there shall be good reason for going through further maintenance burden. Thanks, Andrea