On Mon, Feb 10, 2025 at 05:16:33PM -0800, Vishal Annapurve wrote: > On Wed, Dec 11, 2024 at 10:37 PM Michael Roth <michael.roth@xxxxxxx> wrote: > > > > This patchset is also available at: > > > > https://github.com/amdese/linux/commits/snp-prepare-thp-rfc1 > > > > and is based on top of Paolo's kvm-coco-queue-2024-11 tag which includes > > a snapshot of his patches[1] to provide tracking of whether or not > > sub-pages of a huge folio need to have kvm_arch_gmem_prepare() hooks issued > > before guest access: > > > > d55475f23cea KVM: gmem: track preparedness a page at a time > > 64b46ca6cd6d KVM: gmem: limit hole-punching to ranges within the file > > 17df70a5ea65 KVM: gmem: add a complete set of functions to query page preparedness > > e3449f6841ef KVM: gmem: allocate private data for the gmem inode > > > > [1] https://lore.kernel.org/lkml/20241108155056.332412-1-pbonzini@xxxxxxxxxx/ > > > > This series addresses some of the pending review comments for those patches > > (feel free to squash/rework as-needed), and implements a first real user in > > the form of a reworked version of Sean's original 2MB THP support for gmem. > > > > Looking at the work targeted by Fuad to add in-place memory conversion > support via [1] and Ackerley in future to address hugetlb page > support, can the state tracking for preparedness be simplified as? > i) prepare guest memfd ranges when "first time an offset with > mappability = GUEST is allocated or first time an allocated offset has > mappability = GUEST". Some scenarios that would lead to guest memfd > range preparation: > - Create file with default mappability to host, fallocate, convert > - Create file with default mappability to Guest, guest faults on > private memory Yes, this seems like a compelling approach. One aspect that still remains is knowing *when* the preparation has been done, so that the next time a private page is accessed, either to re-fault into the guest (e.g. because it was originally mapped 2MB and then a sub-page got converted to shared so the still-private pages need to get re-faulted in as 4K), or maybe some other path where KVM needs to grab the private PFN via kvm_gmem_get_pfn() but not actually read/write to it (I think the GHCB AP_CREATION path for bringing up APs might do this). We could just keep re-checking the RMP table to see if the PFN was already set to private in the RMP table, but I think one of the design goals of the preparedness tracking was to have gmem itself be aware of this and not farm it out to platform-specific data structures/tracking. So as a proof of concept I've been experimenting with using Fuad's series ([1] in your response) and adding an additional GUEST_PREPARED state so that it can be tracked via the same mappability xarray (or whatever data structure we end up using for mappability-tracking). In that case GUEST becomes sort of a transient state that can be set in advance of actual allocation/fault-time. That seems to have a lot of nice characteristics, because (in that series at least) guest-mappable (as opposed to all-mappable) specifically corresponds to private guest pages, which for SNP require preparation before they can be mapped into the nested page table so it seems like a natural fit. > ii) Unprepare guest memfd ranges when "first time an offset with > mappability = GUEST is deallocated or first time an allocated offset > has lost mappability = GUEST attribute", some scenarios that would > lead to guest memfd range unprepare: > - Truncation > - Conversion Similar story here: it seems like a good fit. Truncation already does the unprepare via .free_folio->kvm_arch_gmem_invalidate callback, and if we rework THP to behave similar to HugeTLB in that we only free back the full 2MB folio rather than splitting it like in this series, I think that might be sufficient for truncation. If userspace tries to truncate a subset of a 2MB private folio we could no-op and just leave it in GUEST_PREPARED. If we stick with THP, my thinking is we tell userspace what the max granularity is, and userspace will know that it must truncate with that same granularity if it actually wants to free memory. It sounds like the HugeTLB would similarly be providing this sort of information. What's nice is that if we stick with best-effort THP-based allocator, and allow best-effort allocator to fall back to smaller page sizes, this scheme would still work, since we'd still always be able to free folios without splitting. But I'll try to get a better idea of what this looks like in practice. For conversion, we'd need to hook in an additional kvm_arch_gmem_invalidate() somewhere to make sure the folio is host-owned in the RMP table before transitioning to host/all-mappable, but that seems pretty straightforward. > iii) To handle scenarios with hugepages, page splitting/merging in > guest memfd can also signal change in page granularities. Not yet clear to me if extra handling for prepare/unprepare is needed here, but it does seem like an option if needed. Thanks, Mike > > [1] https://lore.kernel.org/kvm/20250117163001.2326672-1-tabba@xxxxxxxxxx/