Re: folio_mmapped

Vishal Annapurve <vannapurve@xxxxxxxxxx> · Mon, 18 Mar 2024 10:06:11 -0700

On Mon, Mar 4, 2024 at 12:17 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 04.03.24 20:04, Sean Christopherson wrote:
> > On Mon, Mar 04, 2024, Quentin Perret wrote:
> >>> As discussed in the sub-thread, that might still be required.
> >>>
> >>> One could think about completely forbidding GUP on these mmap'ed
> >>> guest-memfds. But likely, there might be use cases in the future where you
> >>> want to use GUP on shared memory inside a guest_memfd.
> >>>
> >>> (the iouring example I gave might currently not work because
> >>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
> >>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
> >>> details)
> >>
> >> Perhaps it would be wise to start with GUP being forbidden if the
> >> current users do not need it (not sure if that is the case in Android,
> >> I'll check) ? We can always relax this constraint later when/if the
> >> use-cases arise, which is obviously much harder to do the other way
> >> around.
> >
> > +1000.  At least on the KVM side, I would like to be as conservative as possible
> > when it comes to letting anything other than the guest access guest_memfd.
>
> So we'll have to do it similar to any occurrences of "secretmem" in
> gup.c. We'll have to see how to marry KVM guest_memfd with core-mm code
> similar to e.g., folio_is_secretmem().
>
> IIRC, we might not be able to de-reference the actual mapping because it
> could get free concurrently ...
>
> That will then prohibit any kind of GUP access to these pages, including
> reading/writing for ptrace/debugging purposes, for core dumping purposes
> etc. But at least, you know that nobody was able to optain page
> references using GUP that might be used for reading/writing later.
>

There has been little discussion about supporting 1G pages with
guest_memfd for TDX/SNP or pKVM. I would like to restart this
discussion [1]. 1G pages should be a very important usecase for guest
memfd, especially considering large VM sizes supporting confidential
GPU/TPU workloads.

Using separate backing stores for private and shared memory ranges is
not going to work effectively when using 1G pages. Consider the
following scenario of memory conversion when using 1G pages to back
private memory:
* Guest requests conversion of 4KB range from private to shared, host
in response ideally does following steps:
    a) Updates the guest memory attributes
    b) Unbacks the corresponding private memory
    c) Allocates corresponding shared memory or let it be faulted in
when guest accesses it

Step b above can't be skipped here, otherwise we would have two
physical pages (1 backing private memory, another backing the shared
memory) for the same GPA range causing "double allocation".

With 1G pages, it would be difficult to punch KBs or even MBs sized
hole since to support that:
1G page would need to be split (which hugetlbfs doesn't support today
because of right reasons), causing -
        - loss of vmemmap optimization [3]
        - losing ability to reconstitute the huge page again,
especially as private pages in CVMs are not relocatable today,
increasing overall fragmentation over time.
              - unless a smarter algorithm is devised for memory
reclaim to reconstitute large pages for unmovable memory.

With the above limitations in place, best thing could be to allow:
 - single backing store for both shared and private memory ranges
 - host userspace to mmap the guest memfd (as this series is trying to do)
 - allow userspace to fault in memfd file ranges that correspond to
shared GPA ranges
     - pagetable mappings will need to be restricted to shared memory
ranges causing higher granularity mappings (somewhat similar to what
HGM series from James [2] was trying to do) than 1G.
 - Allow IOMMU also to map those pages (pfns would be requested using
get_user_pages* APIs) to allow devices to access shared memory. IOMMU
management code would have to be enlightened or somehow restricted to
map only shared regions of guest memfd.
 - Upon conversion from shared to private, host will have to ensure
that there are no mappings/references present for the memory ranges
being converted to private.

If the above usecase sounds reasonable, GUP access to guest memfd
pages should be allowed.

[1] https://lore.kernel.org/lkml/CAGtprH_H1afUJ2cUnznWqYLTZVuEcOogRwXF6uBAeHbLMQsrsQ@xxxxxxxxxxxxxx/
[2] https://lore.kernel.org/lkml/20230218002819.1486479-2-jthoughton@xxxxxxxxxx/
[3] https://docs.kernel.org/mm/vmemmap_dedup.html

> --
> Cheers,
>
> David / dhildenb
>