Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

Michael Roth <michael.roth@xxxxxxx> · Wed, 19 Feb 2025 19:09:57 -0600

On Mon, Feb 10, 2025 at 05:16:33PM -0800, Vishal Annapurve wrote:
> On Wed, Dec 11, 2024 at 10:37 PM Michael Roth <michael.roth@xxxxxxx> wrote:
> >
> > This patchset is also available at:
> >
> >   https://github.com/amdese/linux/commits/snp-prepare-thp-rfc1
> >
> > and is based on top of Paolo's kvm-coco-queue-2024-11 tag which includes
> > a snapshot of his patches[1] to provide tracking of whether or not
> > sub-pages of a huge folio need to have kvm_arch_gmem_prepare() hooks issued
> > before guest access:
> >
> >   d55475f23cea KVM: gmem: track preparedness a page at a time
> >   64b46ca6cd6d KVM: gmem: limit hole-punching to ranges within the file
> >   17df70a5ea65 KVM: gmem: add a complete set of functions to query page preparedness
> >   e3449f6841ef KVM: gmem: allocate private data for the gmem inode
> >
> >   [1] https://lore.kernel.org/lkml/20241108155056.332412-1-pbonzini@xxxxxxxxxx/
> >
> > This series addresses some of the pending review comments for those patches
> > (feel free to squash/rework as-needed), and implements a first real user in
> > the form of a reworked version of Sean's original 2MB THP support for gmem.
> >
> 
> Looking at the work targeted by Fuad to add in-place memory conversion
> support via [1] and Ackerley in future to address hugetlb page
> support, can the state tracking for preparedness be simplified as?
> i) prepare guest memfd ranges when "first time an offset with
> mappability = GUEST is allocated or first time an allocated offset has
> mappability = GUEST". Some scenarios that would lead to guest memfd
> range preparation:
>      - Create file with default mappability to host, fallocate, convert
>      - Create file with default mappability to Guest, guest faults on
> private memory

Yes, this seems like a compelling approach. One aspect that still
remains is knowing *when* the preparation has been done, so that the
next time a private page is accessed, either to re-fault into the guest
(e.g. because it was originally mapped 2MB and then a sub-page got
converted to shared so the still-private pages need to get re-faulted
in as 4K), or maybe some other path where KVM needs to grab the private
PFN via kvm_gmem_get_pfn() but not actually read/write to it (I think
the GHCB AP_CREATION path for bringing up APs might do this).

We could just keep re-checking the RMP table to see if the PFN was
already set to private in the RMP table, but I think one of the design
goals of the preparedness tracking was to have gmem itself be aware of
this and not farm it out to platform-specific data structures/tracking.

So as a proof of concept I've been experimenting with using Fuad's
series ([1] in your response) and adding an additional GUEST_PREPARED
state so that it can be tracked via the same mappability xarray (or
whatever data structure we end up using for mappability-tracking).
In that case GUEST becomes sort of a transient state that can be set
in advance of actual allocation/fault-time.

That seems to have a lot of nice characteristics, because (in that
series at least) guest-mappable (as opposed to all-mappable)
specifically corresponds to private guest pages, which for SNP require
preparation before they can be mapped into the nested page table so
it seems like a natural fit.

> ii) Unprepare guest memfd ranges when "first time an offset with
> mappability = GUEST is deallocated or first time an allocated offset
> has lost mappability = GUEST attribute", some scenarios that would
> lead to guest memfd range unprepare:
>      -  Truncation
>      -  Conversion

Similar story here: it seems like a good fit. Truncation already does
the unprepare via .free_folio->kvm_arch_gmem_invalidate callback, and
if we rework THP to behave similar to HugeTLB in that we only free back
the full 2MB folio rather than splitting it like in this series, I think
that might be sufficient for truncation. If userspace tries to truncate
a subset of a 2MB private folio we could no-op and just leave it in
GUEST_PREPARED. If we stick with THP, my thinking is we tell userspace
what the max granularity is, and userspace will know that it must
truncate with that same granularity if it actually wants to free memory.
It sounds like the HugeTLB would similarly be providing this sort of
information. What's nice is that if we stick with best-effort THP-based
allocator, and allow best-effort allocator to fall back to smaller page
sizes, this scheme would still work, since we'd still always be able to
free folios without splitting. But I'll try to get a better idea of what
this looks like in practice.

For conversion, we'd need to hook in an additional
kvm_arch_gmem_invalidate() somewhere to make sure the folio is
host-owned in the RMP table before transitioning to host/all-mappable,
but that seems pretty straightforward.

> iii) To handle scenarios with hugepages, page splitting/merging in
> guest memfd can also signal change in page granularities.

Not yet clear to me if extra handling for prepare/unprepare is needed
here, but it does seem like an option if needed.

Thanks,

Mike

> 
> [1] https://lore.kernel.org/kvm/20250117163001.2326672-1-tabba@xxxxxxxxxx/