On Fri, Feb 03, 2023 at 03:09:55PM +0900, David Stevens wrote: > > > > I don't know if it's necessary to go that far. Userfaultfd plus shmem > > > > is inherently brittle. It's possible for userspace to bypass > > > > userfaultfd on a shmem mapping by accessing the shmem through a > > > > different mapping or simply by using the write syscall. > > > > Yes this is possible, but this is user-visible operation - no matter it was > > a read()/write() from another process, or mmap()ed memory accesses. > > Khugepaged merges ptes in a way that is out of control of users. That's > > something the user can hardly control. > > > > AFAICT currently file-based uffd missing mode all works in that way. IOW > > the user should have full control of the file/inode under the hood to make > > sure there will be nothing surprising. Otherwise I don't really see how > > the missing mode can work solidly since it's page cache based. > > > > > > It might be sufficient to say that the kernel won't directly bypass a > > > > VMA's userfaultfd to collapse the underlying shmem's pages. Although on > > > > the other hand, I guess it's not great for the presence of an unused > > > > shmem mapping lying around to cause khugepaged to have user-visible > > > > side effects. > > > > Maybe it works for your use case already, for example, if in your app the > > shmem is only and always be mapped once? However that doesn't seem like a > > complete solution to me. > > We're using userfaultfd for guest memory for a VM. We do have > sandboxed device processes. However, thinking about it a bit more, > this approach would probably cause issues with device hotplug. > > > There's nothing that will prevent another mapping being established, and > > right after that happens it'll stop working, because khugepaged can notice > > that new mm/vma which doesn't register with uffd at all, and thinks it a > > good idea to collapse the shmem page cache again. Uffd will silently fail > > in another case even if not immediately in your current app/reproducer. > > > > Again, I don't think what I propose above is anything close to good.. It'll > > literally disable any collapsing possibility for a shmem node as long as > > any small portion of the inode mapping address space got registered by any > > process with uffd. I just don't see any easier approach so far. > > Maybe we can make things easier by being more precise about what bug > we're trying to fix. Strictly speaking, I don't think what we're > concerned about is whether or not userfaultfd is registered on a > particular VMA at a particular point in time. I think what we're actually > concerned about is that when userspace has a page with an armed > userfaultfd that it knows is missing, that page should not be filled by > khugepaged. If userspace doesn't know that a userfaultfd armed page is > missing, then even if khugepaged fills that page, as far as userspace is > concerned, the page was filled by khugepaged before userfaultfd was > armed. IMHO that's a common issue to solve with registrations on uffd missing mode, and what I'm aware of as a common solution to it is, for shmem, punching holes where we do hope to receive a message. It should only happen _after_ the missing mode registration is successful. At least that's what we do with QEMU's postcopy. > > If that's a valid way to look at it, then I think the fact that > collapse_file locks hpage provides most of the necessary locking. The hpage is still not visible to the page cache at all, not until it is used to replace the small pages, right? Do you mean "when holding the page lock of the existing small pages"? > From there, we need to check whether there are any VMAs with armed > userfaultfds that might have observed a missing page. I think that can be > done while iterating over VMAs in retract_page_tables without acquiring I am not 100% sure how this works. AFAICT we retract pgtables only if we have already collapsed the page. I don't see how it can be undone? > any mmap_lock by adding some memory barriers to userfaultfd_set_vm_flags > and userfaultfd_armed. It is possible that a userfaultfd gets registered > on a particular VMA after we check its flags but before the collapse > finishes. I think the only observability hole left would be operations on > the shmem file descriptor that don't actually lock pages > (e.g. SEEK_DATA/SEEK_HOLE), which are hopefully solvable with some more > thought. So it's possible that I just didn't grasp the fundamental idea of the new proposal on how it works at all. If you're confident with the idea I'd be also glad to read the patch directly; maybe that'll help to explain stuff. Thanks, -- Peter Xu