> > > I don't know if it's necessary to go that far. Userfaultfd plus shmem > > > is inherently brittle. It's possible for userspace to bypass > > > userfaultfd on a shmem mapping by accessing the shmem through a > > > different mapping or simply by using the write syscall. > > Yes this is possible, but this is user-visible operation - no matter it was > a read()/write() from another process, or mmap()ed memory accesses. > Khugepaged merges ptes in a way that is out of control of users. That's > something the user can hardly control. > > AFAICT currently file-based uffd missing mode all works in that way. IOW > the user should have full control of the file/inode under the hood to make > sure there will be nothing surprising. Otherwise I don't really see how > the missing mode can work solidly since it's page cache based. > > > > It might be sufficient to say that the kernel won't directly bypass a > > > VMA's userfaultfd to collapse the underlying shmem's pages. Although on > > > the other hand, I guess it's not great for the presence of an unused > > > shmem mapping lying around to cause khugepaged to have user-visible > > > side effects. > > Maybe it works for your use case already, for example, if in your app the > shmem is only and always be mapped once? However that doesn't seem like a > complete solution to me. We're using userfaultfd for guest memory for a VM. We do have sandboxed device processes. However, thinking about it a bit more, this approach would probably cause issues with device hotplug. > There's nothing that will prevent another mapping being established, and > right after that happens it'll stop working, because khugepaged can notice > that new mm/vma which doesn't register with uffd at all, and thinks it a > good idea to collapse the shmem page cache again. Uffd will silently fail > in another case even if not immediately in your current app/reproducer. > > Again, I don't think what I propose above is anything close to good.. It'll > literally disable any collapsing possibility for a shmem node as long as > any small portion of the inode mapping address space got registered by any > process with uffd. I just don't see any easier approach so far. Maybe we can make things easier by being more precise about what bug we're trying to fix. Strictly speaking, I don't think what we're concerned about is whether or not userfaultfd is registered on a particular VMA at a particular point in time. I think what we're actually concerned about is that when userspace has a page with an armed userfaultfd that it knows is missing, that page should not be filled by khugepaged. If userspace doesn't know that a userfaultfd armed page is missing, then even if khugepaged fills that page, as far as userspace is concerned, the page was filled by khugepaged before userfaultfd was armed. If that's a valid way to look at it, then I think the fact that collapse_file locks hpage provides most of the necessary locking. From there, we need to check whether there are any VMAs with armed userfaultfds that might have observed a missing page. I think that can be done while iterating over VMAs in retract_page_tables without acquiring any mmap_lock by adding some memory barriers to userfaultfd_set_vm_flags and userfaultfd_armed. It is possible that a userfaultfd gets registered on a particular VMA after we check its flags but before the collapse finishes. I think the only observability hole left would be operations on the shmem file descriptor that don't actually lock pages (e.g. SEEK_DATA/SEEK_HOLE), which are hopefully solvable with some more thought. -David > Thanks, > > -- > Peter Xu >