Re: folio_mmapped

David Hildenbrand <david@xxxxxxxxxx> · Mon, 18 Mar 2024 23:02:17 +0100

On 18.03.24 18:06, Vishal Annapurve wrote:
On Mon, Mar 4, 2024 at 12:17 PM David Hildenbrand <david@xxxxxxxxxx> wrote:

On 04.03.24 20:04, Sean Christopherson wrote:
On Mon, Mar 04, 2024, Quentin Perret wrote:
As discussed in the sub-thread, that might still be required.

One could think about completely forbidding GUP on these mmap'ed
guest-memfds. But likely, there might be use cases in the future where you
want to use GUP on shared memory inside a guest_memfd.

(the iouring example I gave might currently not work because
FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
details)

Perhaps it would be wise to start with GUP being forbidden if the
current users do not need it (not sure if that is the case in Android,
I'll check) ? We can always relax this constraint later when/if the
use-cases arise, which is obviously much harder to do the other way
around.

+1000.  At least on the KVM side, I would like to be as conservative as possible
when it comes to letting anything other than the guest access guest_memfd.

So we'll have to do it similar to any occurrences of "secretmem" in
gup.c. We'll have to see how to marry KVM guest_memfd with core-mm code
similar to e.g., folio_is_secretmem().

IIRC, we might not be able to de-reference the actual mapping because it
could get free concurrently ...

That will then prohibit any kind of GUP access to these pages, including
reading/writing for ptrace/debugging purposes, for core dumping purposes
etc. But at least, you know that nobody was able to optain page
references using GUP that might be used for reading/writing later.

There has been little discussion about supporting 1G pages with
guest_memfd for TDX/SNP or pKVM. I would like to restart this
discussion [1]. 1G pages should be a very important usecase for guest
memfd, especially considering large VM sizes supporting confidential
GPU/TPU workloads.

Using separate backing stores for private and shared memory ranges is
not going to work effectively when using 1G pages. Consider the
following scenario of memory conversion when using 1G pages to back
private memory:
* Guest requests conversion of 4KB range from private to shared, host
in response ideally does following steps:
     a) Updates the guest memory attributes
     b) Unbacks the corresponding private memory
     c) Allocates corresponding shared memory or let it be faulted in
when guest accesses it

Step b above can't be skipped here, otherwise we would have two
physical pages (1 backing private memory, another backing the shared
memory) for the same GPA range causing "double allocation".

With 1G pages, it would be difficult to punch KBs or even MBs sized
hole since to support that:
1G page would need to be split (which hugetlbfs doesn't support today
because of right reasons), causing -
         - loss of vmemmap optimization [3]
         - losing ability to reconstitute the huge page again,
especially as private pages in CVMs are not relocatable today,
increasing overall fragmentation over time.
               - unless a smarter algorithm is devised for memory
reclaim to reconstitute large pages for unmovable memory.

With the above limitations in place, best thing could be to allow:
  - single backing store for both shared and private memory ranges
  - host userspace to mmap the guest memfd (as this series is trying to do)
  - allow userspace to fault in memfd file ranges that correspond to
shared GPA ranges
      - pagetable mappings will need to be restricted to shared memory
ranges causing higher granularity mappings (somewhat similar to what
HGM series from James [2] was trying to do) than 1G.
  - Allow IOMMU also to map those pages (pfns would be requested using
get_user_pages* APIs) to allow devices to access shared memory. IOMMU
management code would have to be enlightened or somehow restricted to
map only shared regions of guest memfd.
  - Upon conversion from shared to private, host will have to ensure
that there are no mappings/references present for the memory ranges
being converted to private.

If the above usecase sounds reasonable, GUP access to guest memfd
pages should be allowed.

To say it with nice words: "Not a fan".

First, I don't think only 1 GiB will be problematic. Already 2 MiB ones 
will be problematic and so far it is even unclear how guest_memfd will 
consume them in a way acceptable to upstream MM. Likely not using 
hugetlb from what I recall after the previous discussions with Mike.

Second, we should find better ways to let an IOMMU map these pages, 
*not* using GUP. There were already discussions on providing a similar 
fd+offset-style interface instead. GUP really sounds like the wrong 
approach here. Maybe we should look into passing not only guest_memfd, 
but also "ordinary" memfds.

Third, I don't think we should be using huge pages where huge pages 
don't make any sense. Using a 1 GiB page so the VM will convert some 
pieces to map it using PTEs will destroy the whole purpose of using 1 
GiB pages. It doesn't make any sense.

A direction that might make sense is either (A) enlighting the VM about 
the granularity in which memory can be converted (but also problematic 
for 1 GiB pages) and/or (B) physically restricting the memory that can 
be converted.

For example, one could create a GPA layout where some regions are backed 
by gigantic pages that cannot be converted/can only be converted as a 
whole, and some are backed by 4k pages that can be converted back and 
forth. We'd use multiple guest_memfds for that. I recall that physically 
restricting such conversions/locations (e.g., for bounce buffers) in 
Linux was already discussed somewhere, but I don't recall the details.

It's all not trivial and not easy to get "clean".

Concluding that individual pieces of a 1 GiB / 2 MiB huge page should 
not be converted back and forth might be a reasonable. Although I'm sure 
people will argue the opposite and develop hackish solutions in 
desperate ways to make it work somehow.

Huge pages, and especially gigantic pages, are simply a bad fit if the 
VM will convert individual 4k pages.

But to answer your last question: we might be able to avoid GUP by using 
a different mapping API, similar to the once KVM now provides.

--
Cheers,

David / dhildenb