Hello, On Sat, Oct 12, 2019 at 06:14:23PM -0700, Andy Lutomirski wrote: > [adding more people because this is going to be an ABI break, sigh] That wouldn't break the ABI, no more than when if you boot a kernel built with CONFIG_USERFAULTFD=n. All non-cooperative features can be removed any time in a backwards compatible way, the only precaution is to mark their feature bits as reserved so they can't be reused for something else later. > least severely restricted. A .read implementation MUST NOT ACT ON THE > CALLING TASK. Ever. Just imagine the effect of passing a userfaultfd > as stdin to a setuid program. With UFFD_EVENT_FORK, the newly created uffd that controls the child, is not passed to the parent nor to the child. Instead it's passed to the CRIU monitor only, which has to be already running as root and is fully trusted and acts a hypervisor (despite there is no hypervisor). By the time execve runs and any suid bit in the execve'd inode becomes relevant, well before the new userland executable code can run, the kernel throws away the "old_mm" controlled by any uffd and all attached uffds are released as well. All I found is your "A .read implementation MUST NOT ACT ON THE CALLING TASK" as an explanation that something is broken but I need further clarification. Of course I can see you can always open a uffd and pass it to any task you are going to execve on, but that simply means the suid program will be able to control you, not the other way around. If you don't want to be controlled by the next task, no matter if suid or not, just don't that. What I don't see is how you're going to control the suid binary from the outside, the suid binary at most will block in the poll, read and write syscalls and get garbage or write some garbage and get an error, it won't get signals and it cannot block in any page fault either, it's not immediately clear what's out of ordinary. On Mon, Oct 14, 2019 at 06:04:22PM +0200, Jann Horn wrote: > FWIW, <https://codesearch.debian.net/search?q=UFFD_FEATURE_EVENT_FORK&literal=1> > just shows the kernel, kernel selftests, and strace code for decoding > syscall arguments. CRIU uses it though (probably for postcopy live > migration / lazy migration?), I guess that code isn't in debian for > some reason. https://criu.org/Userfaultfd#Limitations The CRIU developers did a truly amazing job by making container post copy live migration work great for a subset of apps, that alone was an amazing achievement. Is that achievement enough to use post copy live migration of bare metal containers in production? Unfortunately probably not and not just in debian. If you're wrong and UFFDIO_EVENT_FORK isn't currently buggy and in turn it isn't causing further maintenance burden, there is no hurry of removing them, but in the long term, if none of the non-cooperative features find its way in production (like it was reasonable to expect initially), they must be removed from the kernel anyway, not just UFFD_EVEN_FORK but all non-cooperative features associated with it. In my view the kernel is complex enough that we can't keep in the kernel anything that isn't actively used in production. If you're right and UFFDIO_EVENT_FORK is flawed beyond repair well then we should remove all non cooperative features right now. Or is someone out there using CRIU --lazy-pages in production? Virtual machine machine postcopy live migration is shipped in production for years and it's fully reliable and by design it cannot suffer from any of the above limitations. In my view there's simply no justification not to use virtual machines when the alternative requires so much more code to be written and so much more complexity to be dealt with. However the higher complexity happened in lots areas of the kernel already where things got extremely complex just to avoid using virtual machines, despite the end result is less secure, for the only sake of slightly higher consolidation (especially if you ignore the existence millisecond guest boot time, direct mapped pmem nvdimm, virtfs and free page hinting). The non-cooperative features of userfaultfd in principle aren't fundamentally different from other cgroup vs KVM tradeoffs, so 1) it wasn't apparent they wouldn't be used in production eventually and 2) it didn't sound fair enough not to give a chance to bare metal containers to achieve feature parity with VM (again with a much higher code complexity, but that was to be expected and it has apparently been dealt with in other areas with more or less satisfactory results). While at it, as far as userfaultfd is concerned I'd rather see people spend time to write a malloc library that uses userfaultfd with the UFFD_FEATURE_SIGBUS features and it replaces mmap with UFFDIO_ZEROPAGE (plus adding the THP accelleration currently missing) and munmap with MADV_DONTNEED to do allocations and freeing of memory with full scalability without ever hitting on the map sem for writing. This is already possible without any further kernel change (the THP accelleration to UFFDIO_ZEROPAGE will only make it go faster but it could be done later after the lib already works because it'd be invisible to userland). On my side, instead of trying to fix whatever issue in UFFD_EVENT_FORK, I'd prefer to spend my time reviewing the uffd-wp feature from Peter and the page fault enhancement patchset that Peter and Linus were discussing. uffd-wp has the potential to drop fork() from all apps calling fork() only to do an atomic snapshot of their memory. Replacing fork() also means the uffd manager thread can decide how much memory to reserve to the snapshot and it can start throttling waiting for I/O completion if the threshold is exceeded, while fork COWs cannot throttle and all apps using fork() risk to hit on x2 memory usage which can become oom-killer material if the memory size of the process is huge. The side benefit is also that the way userfaultfd works the fault granularity is entirely in control of userland (because it's always userland that resolves the fault), it could decide to use 8k or 16k even if that doesn't match the hardware page size. That will allow to keep THP on without risking to hit on 2M cows during the snapshot. Being able to keep THP enabled in nosql db without hitting on slow 2M COW copies during snapshot, should allow a further overall performance improvement when the snapshot is not running than what it is possible today. In a completely different use case, uffd-wp will also avoid JITs to set a dirty bit every time they modify any data in memory. It should also be possible to provide the same soft-dirty information in O(1) instead of O(N). Thanks, Andrea