Re: folio_mmapped

Vishal Annapurve <vannapurve@xxxxxxxxxx> · Mon, 18 Mar 2024 16:07:16 -0700

On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 18.03.24 18:06, Vishal Annapurve wrote:
> > On Mon, Mar 4, 2024 at 12:17 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >>
> >> On 04.03.24 20:04, Sean Christopherson wrote:
> >>> On Mon, Mar 04, 2024, Quentin Perret wrote:
> >>>>> As discussed in the sub-thread, that might still be required.
> >>>>>
> >>>>> One could think about completely forbidding GUP on these mmap'ed
> >>>>> guest-memfds. But likely, there might be use cases in the future where you
> >>>>> want to use GUP on shared memory inside a guest_memfd.
> >>>>>
> >>>>> (the iouring example I gave might currently not work because
> >>>>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
> >>>>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
> >>>>> details)
> >>>>
> >>>> Perhaps it would be wise to start with GUP being forbidden if the
> >>>> current users do not need it (not sure if that is the case in Android,
> >>>> I'll check) ? We can always relax this constraint later when/if the
> >>>> use-cases arise, which is obviously much harder to do the other way
> >>>> around.
> >>>
> >>> +1000.  At least on the KVM side, I would like to be as conservative as possible
> >>> when it comes to letting anything other than the guest access guest_memfd.
> >>
> >> So we'll have to do it similar to any occurrences of "secretmem" in
> >> gup.c. We'll have to see how to marry KVM guest_memfd with core-mm code
> >> similar to e.g., folio_is_secretmem().
> >>
> >> IIRC, we might not be able to de-reference the actual mapping because it
> >> could get free concurrently ...
> >>
> >> That will then prohibit any kind of GUP access to these pages, including
> >> reading/writing for ptrace/debugging purposes, for core dumping purposes
> >> etc. But at least, you know that nobody was able to optain page
> >> references using GUP that might be used for reading/writing later.
> >>
> >
> > There has been little discussion about supporting 1G pages with
> > guest_memfd for TDX/SNP or pKVM. I would like to restart this
> > discussion [1]. 1G pages should be a very important usecase for guest
> > memfd, especially considering large VM sizes supporting confidential
> > GPU/TPU workloads.
> >
> > Using separate backing stores for private and shared memory ranges is
> > not going to work effectively when using 1G pages. Consider the
> > following scenario of memory conversion when using 1G pages to back
> > private memory:
> > * Guest requests conversion of 4KB range from private to shared, host
> > in response ideally does following steps:
> >      a) Updates the guest memory attributes
> >      b) Unbacks the corresponding private memory
> >      c) Allocates corresponding shared memory or let it be faulted in
> > when guest accesses it
> >
> > Step b above can't be skipped here, otherwise we would have two
> > physical pages (1 backing private memory, another backing the shared
> > memory) for the same GPA range causing "double allocation".
> >
> > With 1G pages, it would be difficult to punch KBs or even MBs sized
> > hole since to support that:
> > 1G page would need to be split (which hugetlbfs doesn't support today
> > because of right reasons), causing -
> >          - loss of vmemmap optimization [3]
> >          - losing ability to reconstitute the huge page again,
> > especially as private pages in CVMs are not relocatable today,
> > increasing overall fragmentation over time.
> >                - unless a smarter algorithm is devised for memory
> > reclaim to reconstitute large pages for unmovable memory.
> >
> > With the above limitations in place, best thing could be to allow:
> >   - single backing store for both shared and private memory ranges
> >   - host userspace to mmap the guest memfd (as this series is trying to do)
> >   - allow userspace to fault in memfd file ranges that correspond to
> > shared GPA ranges
> >       - pagetable mappings will need to be restricted to shared memory
> > ranges causing higher granularity mappings (somewhat similar to what
> > HGM series from James [2] was trying to do) than 1G.
> >   - Allow IOMMU also to map those pages (pfns would be requested using
> > get_user_pages* APIs) to allow devices to access shared memory. IOMMU
> > management code would have to be enlightened or somehow restricted to
> > map only shared regions of guest memfd.
> >   - Upon conversion from shared to private, host will have to ensure
> > that there are no mappings/references present for the memory ranges
> > being converted to private.
> >
> > If the above usecase sounds reasonable, GUP access to guest memfd
> > pages should be allowed.
>
> To say it with nice words: "Not a fan".
>
> First, I don't think only 1 GiB will be problematic. Already 2 MiB ones
> will be problematic and so far it is even unclear how guest_memfd will
> consume them in a way acceptable to upstream MM. Likely not using
> hugetlb from what I recall after the previous discussions with Mike.
>

Agree, the support for 1G pages with guest memfd is yet to be figured
out, but it remains a scenario to be considered here.

> Second, we should find better ways to let an IOMMU map these pages,
> *not* using GUP. There were already discussions on providing a similar
> fd+offset-style interface instead. GUP really sounds like the wrong
> approach here. Maybe we should look into passing not only guest_memfd,
> but also "ordinary" memfds.

I need to dig into past discussions around this, but agree that
passing guest memfds to VFIO drivers in addition to HVAs seems worth
exploring. This may be required anyways for devices supporting TDX
connect [1].

If we are talking about the same file catering to both private and
shared memory, there has to be some way to keep track of references on
the shared memory from both host userspace and IOMMU.

>
> Third, I don't think we should be using huge pages where huge pages
> don't make any sense. Using a 1 GiB page so the VM will convert some
> pieces to map it using PTEs will destroy the whole purpose of using 1
> GiB pages. It doesn't make any sense.

I had started a discussion for this [2] using an RFC series. Main
challenge here remain:
1) Unifying all the conversions under one layer
2) Ensuring shared memory allocations are huge page aligned at boot
time and runtime.

Using any kind of unified shared memory allocator (today this part is
played by SWIOTLB) will need to support huge page aligned dynamic
increments, which can be only guaranteed by carving out enough memory
at boot time for CMA area and using CMA area for allocation at
runtime.
   - Since it's hard to come up with a maximum amount of shared memory
needed by VM, especially with GPUs/TPUs around, it's difficult to come
up with CMA area size at boot time.

I think it's arguable that even if a VM converts 10 % of its memory to
shared using 4k granularity, we still have fewer page table walks on
the rest of the memory when using 1G/2M pages, which is a significant
portion.

>
> A direction that might make sense is either (A) enlighting the VM about
> the granularity in which memory can be converted (but also problematic
> for 1 GiB pages) and/or (B) physically restricting the memory that can
> be converted.

Physically restricting the memory will still need a safe maximum bound
to be calculated based on all the shared memory usecases that VM can
encounter.

>
> For example, one could create a GPA layout where some regions are backed
> by gigantic pages that cannot be converted/can only be converted as a
> whole, and some are backed by 4k pages that can be converted back and
> forth. We'd use multiple guest_memfds for that. I recall that physically
> restricting such conversions/locations (e.g., for bounce buffers) in
> Linux was already discussed somewhere, but I don't recall the details.
>
> It's all not trivial and not easy to get "clean".

Yeah, agree with this point, it's difficult to get a clean solution
here, but the host side solution might be easier to deploy (not
necessarily easier to implement) and possibly cleaner than attempts to
regulate the guest side.

>
> Concluding that individual pieces of a 1 GiB / 2 MiB huge page should
> not be converted back and forth might be a reasonable. Although I'm sure
> people will argue the opposite and develop hackish solutions in
> desperate ways to make it work somehow.
>
> Huge pages, and especially gigantic pages, are simply a bad fit if the
> VM will convert individual 4k pages.
>
>
> But to answer your last question: we might be able to avoid GUP by using
> a different mapping API, similar to the once KVM now provides.
>
> --
> Cheers,
>
> David / dhildenb
>

[1] -> https://cdrdv2.intel.com/v1/dl/getContent/773614
[2] https://lore.kernel.org/lkml/20240112055251.36101-2-vannapurve@xxxxxxxxxx/