Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Fri, 14 Mar 2025 17:16:50 +0800

On Wed, Feb 19, 2025 at 07:09:57PM -0600, Michael Roth wrote:
> On Mon, Feb 10, 2025 at 05:16:33PM -0800, Vishal Annapurve wrote:
> > On Wed, Dec 11, 2024 at 10:37 PM Michael Roth <michael.roth@xxxxxxx> wrote:
> > >
> > > This patchset is also available at:
> > >
> > >   https://github.com/amdese/linux/commits/snp-prepare-thp-rfc1
> > >
> > > and is based on top of Paolo's kvm-coco-queue-2024-11 tag which includes
> > > a snapshot of his patches[1] to provide tracking of whether or not
> > > sub-pages of a huge folio need to have kvm_arch_gmem_prepare() hooks issued
> > > before guest access:
> > >
> > >   d55475f23cea KVM: gmem: track preparedness a page at a time
> > >   64b46ca6cd6d KVM: gmem: limit hole-punching to ranges within the file
> > >   17df70a5ea65 KVM: gmem: add a complete set of functions to query page preparedness
> > >   e3449f6841ef KVM: gmem: allocate private data for the gmem inode
> > >
> > >   [1] https://lore.kernel.org/lkml/20241108155056.332412-1-pbonzini@xxxxxxxxxx/
> > >
> > > This series addresses some of the pending review comments for those patches
> > > (feel free to squash/rework as-needed), and implements a first real user in
> > > the form of a reworked version of Sean's original 2MB THP support for gmem.
> > >
> > 
> > Looking at the work targeted by Fuad to add in-place memory conversion
> > support via [1] and Ackerley in future to address hugetlb page
> > support, can the state tracking for preparedness be simplified as?
> > i) prepare guest memfd ranges when "first time an offset with
> > mappability = GUEST is allocated or first time an allocated offset has
> > mappability = GUEST". Some scenarios that would lead to guest memfd
> > range preparation:
> >      - Create file with default mappability to host, fallocate, convert
> >      - Create file with default mappability to Guest, guest faults on
> > private memory
> 
> Yes, this seems like a compelling approach. One aspect that still
> remains is knowing *when* the preparation has been done, so that the
> next time a private page is accessed, either to re-fault into the guest
> (e.g. because it was originally mapped 2MB and then a sub-page got
> converted to shared so the still-private pages need to get re-faulted
> in as 4K), or maybe some other path where KVM needs to grab the private
> PFN via kvm_gmem_get_pfn() but not actually read/write to it (I think
> the GHCB AP_CREATION path for bringing up APs might do this).
> 
> We could just keep re-checking the RMP table to see if the PFN was
> already set to private in the RMP table, but I think one of the design
> goals of the preparedness tracking was to have gmem itself be aware of
> this and not farm it out to platform-specific data structures/tracking.
> 
> So as a proof of concept I've been experimenting with using Fuad's
> series ([1] in your response) and adding an additional GUEST_PREPARED
> state so that it can be tracked via the same mappability xarray (or
> whatever data structure we end up using for mappability-tracking).
> In that case GUEST becomes sort of a transient state that can be set
> in advance of actual allocation/fault-time.
Hi Michael,

We are currently working on enabling 2M huge pages on TDX.
We noticed this series and hope if could also work with TDX huge pages.

While disallowing <2M page conversion is also not ideal for TDX, we also think
that it would be great if we could start with 2M and non-in-place conversion
first. In that case, is memory fragmentation caused by partial discarding a
problem for you [1]? Is page promotion a must in your initial huge page support?

Do you have any repo containing your latest POC?

Thanks
Yan

[1] https://lore.kernel.org/all/Z9PyLE%2FLCrSr2jCM@xxxxxxxxxxxxxxxxxxxxxxxxx/