Re: [PATCH RFC 0/5] mm/gup: Introduce exclusive GUP pinning

Ackerley Tng <ackerleytng@xxxxxxxxxx> · Fri, 12 Jul 2024 23:29:37 +0000

Here’s an update from the Linux MM Alignment Session on July 10 2024, 9-10am
PDT:

The current direction is:

+ Allow mmap() of ranges that cover both shared and private memory, but disallow
  faulting in of private pages
  + On access to private pages, userspace will get some error, perhaps SIGBUS
  + On shared to private conversions, unmap the page and decrease refcounts

+ To support huge pages, guest_memfd will take ownership of the hugepages, and
  provide interested parties (userspace, KVM, iommu) with pages to be used.
  + guest_memfd will track usage of (sub)pages, for both private and shared
    memory
  + Pages will be broken into smaller (probably 4K) chunks at creation time to
    simplify implementation (as opposed to splitting at runtime when private to
    shared conversion is requested by the guest)
    + Core MM infrastructure will still be used to track page table mappings in
      mapcounts and other references (refcounts) per subpage
    + HugeTLB vmemmap Optimization (HVO) is lost when pages are broken up - to
      be optimized later. Suggestions:
      + Use a tracking data structure other than struct page
      + Remove the memory for struct pages backing private memory from the
        vmemmap, and re-populate the vmemmap on conversion from private to
        shared
  + Implementation pointers for huge page support
    + Consensus was that getting core MM to do tracking seems wrong
    + Maintaining special page refcounts for guest_memfd pages is difficult to
      get working and requires weird special casing in many places. This was
      tried for FS DAX pages and did not work out: [1]

+ Implementation suggestion: use infrastructure similar to what ZONE_DEVICE
  uses, to provide the huge page to interested parties
  + TBD: how to actually get huge pages into guest_memfd
  + TBD: how to provide/convert the huge pages to ZONE_DEVICE
    + Perhaps reserve them at boot time like in HugeTLB

+ Line of sight to compaction/migration:
  + Compaction here means making memory contiguous
  + Compaction/migration scope:
    + In scope for 4K pages
    + Out of scope for 1G pages and anything managed through ZONE_DEVICE
    + Out of scope for an initial implementation
  + Ideas for future implementations
    + Reuse the non-LRU page migration framework as used by memory balloning
    + Have userspace drive compaction/migration via ioctls
      + Having line of sight to optimizing lost HVO means avoiding being locked
        in to any implementation requiring struct pages
        + Without struct pages, it is hard to reuse core MM’s
          compaction/migration infrastructure

+ Discuss more details at LPC in Sep 2024, such as how to use huge pages,
  shared/private conversion, huge page splitting

This addresses the prerequisites set out by Fuad and Elliott at the beginning of
the session, which were:

1. Non-destructive shared/private conversion
  + Through having guest_memfd manage and track both shared/private memory
2. Huge page support with the option of converting individual subpages
  + Splitting of pages will be managed by guest_memfd
3. Line of sight to compaction/migration of private memory
  + Possibly driven by userspace using guest_memfd ioctls
4. Loading binaries into guest (private) memory before VM starts
  + This was identified as a special case of (1.) above
5. Non-protected guests in pKVM
  + Not discussed during session, but this is a goal of guest_memfd, for all VM
    types [2]

David Hildenbrand summarized this during the meeting at t=47m25s [3].

[1]: https://lore.kernel.org/linux-mm/cover.66009f59a7fe77320d413011386c3ae5c2ee82eb.1719386613.git-series.apopple@xxxxxxxxxx/
[2]: https://lore.kernel.org/lkml/ZnRMn1ObU8TFrms3@xxxxxxxxxx/
[3]: https://drive.google.com/file/d/17lruFrde2XWs6B1jaTrAy9gjv08FnJ45/view?t=47m25s&resourcekey=0-LiteoxLd5f4fKoPRMjMTOw