Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Wed, 19 Mar 2025 15:39:08 +0800



On Tue, Mar 18, 2025 at 08:13:05PM +0100, David Hildenbrand wrote:
> On 18.03.25 03:24, Yan Zhao wrote:
> > On Fri, Mar 14, 2025 at 07:19:33PM +0800, Yan Zhao wrote:
> > > On Fri, Mar 14, 2025 at 10:33:07AM +0100, David Hildenbrand wrote:
> > > > On 14.03.25 10:09, Yan Zhao wrote:
> > > > > On Wed, Jan 22, 2025 at 03:25:29PM +0100, David Hildenbrand wrote:
> > > > > > (split is possible if there are no unexpected folio references; private
> > > > > > pages cannot be GUP'ed, so it is feasible)
> > > > > ...
> > > > > > > > Note that I'm not quite sure about the "2MB" interface, should it be
> > > > > > > > a
> > > > > > > > "PMD-size" interface?
> > > > > > > 
> > > > > > > I think Mike and I touched upon this aspect too - and I may be
> > > > > > > misremembering - Mike suggested getting 1M, 2M, and bigger page sizes
> > > > > > > in increments -- and then fitting in PMD sizes when we've had enough of
> > > > > > > those.  That is to say he didn't want to preclude it, or gate the PMD
> > > > > > > work on enabling all sizes first.
> > > > > > 
> > > > > > Starting with 2M is reasonable for now. The real question is how we want to
> > > > > > deal with
> > > > > Hi David,
> > > > > 
> > > > 
> > > > Hi!
> > > > 
> > > > > I'm just trying to understand the background of in-place conversion.
> > > > > 
> > > > > Regarding to the two issues you mentioned with THP and non-in-place-conversion,
> > > > > I have some questions (still based on starting with 2M):
> > > > > 
> > > > > > (a) Not being able to allocate a 2M folio reliably
> > > > > If we start with fault in private pages from guest_memfd (not in page pool way)
> > > > > and shared pages anonymously, is it correct to say that this is only a concern
> > > > > when memory is under pressure?
> > > > 
> > > > Usually, fragmentation starts being a problem under memory pressure, and
> > > > memory pressure can show up simply because the page cache makes us of as
> > > > much memory as it wants.
> > > > 
> > > > As soon as we start allocating a 2 MB page for guest_memfd, to then split it
> > > > up + free only some parts back to the buddy (on private->shared conversion),
> > > > we create fragmentation that cannot get resolved as long as the remaining
> > > > private pages are not freed. A new conversion from shared->private on the
> > > > previously freed parts will allocate other unmovable pages (not the freed
> > > > ones) and make fragmentation worse.
> > > Ah, I see. The problem of fragmentation is because memory allocated by
> > > guest_memfd is unmovable. So after freeing part of a 2MB folio, the whole 2MB is
> > > still unmovable.
> > > 
> > > I previously thought fragmentation would only impact the guest by providing no
> > > new huge pages. So if a confidential VM does not support merging small PTEs into
> > > a huge PMD entry in its private page table, even if the new huge memory range is
> > > physically contiguous after a private->shared->private conversion, the guest
> > > still cannot bring back huge pages.
> > > 
> > > > In-place conversion improves that quite a lot, because guest_memfd tself
> > > > will not cause unmovable fragmentation. Of course, under memory pressure,
> > > > when and cannot allocate a 2M page for guest_memfd, it's unavoidable. But
> > > > then, we already had fragmentation (and did not really cause any new one).
> > > > 
> > > > We discussed in the upstream call, that if guest_memfd (primarily) only
> > > > allocates 2M pages and frees 2M pages, it will not cause fragmentation
> > > > itself, which is pretty nice.
> > > Makes sense.
> > > 
> > > > > 
> > > > > > (b) Partial discarding
> > > > > For shared pages, page migration and folio split are possible for shared THP?
> > > > 
> > > > I assume by "shared" you mean "not guest_memfd, but some other memory we use
> > > Yes, not guest_memfd, in the case of non-in-place conversion.
> > > 
> > > > as an overlay" -- so no in-place conversion.
> > > > 
> > > > Yes, that should be possible as long as nothing else prevents
> > > > migration/split (e.g., longterm pinning)
> > > > 
> > > > > 
> > > > > For private pages, as you pointed out earlier, if we can ensure there are no
> > > > > unexpected folio references for private memory, splitting a private huge folio
> > > > > should succeed.
> > > > 
> > > > Yes, and maybe (hopefully) we'll reach a point where private parts will not
> > > > have a refcount at all (initially, frozen refcount, discussed during the
> > > > last upstream call).
> > > Yes, I also tested in TDX by not acquiring folio ref count in TDX specific code
> > > and found that partial splitting could work.
> > > 
> > > > Are you concerned about the memory fragmentation after repeated
> > > > > partial conversions of private pages to and from shared?
> > > > 
> > > > Not only repeated, even just a single partial conversion. But of course,
> > > > repeated partial conversions will make it worse (e.g., never getting a
> > > > private huge page back when there was a partial conversion).
> > > Thanks for the explanation!
> > > 
> > > Do you think there's any chance for guest_memfd to support non-in-place
> > > conversion first?
> > e.g. we can have private pages allocated from guest_memfd and allows the
> > private pages to be THP.
> > 
> > Meanwhile, shared pages are not allocated from guest_memfd, and let it only
> > fault in 4K granularity. (specify it by a flag?)
> > 
> > When we want to convert a 4K from a 2M private folio to shared, we can just
> > split the 2M private folio as there's no extra ref count of private pages;
> 
> Yes, IIRC that's precisely what this series is doing, because the
> ftruncate() will try splitting the folio (which might still fail on
> speculative references, see my comment as rely to this series)
> 
> In essence: yes, splitting to 4k should work (although speculative reference
> might require us to retry). But the "4k hole punch" is the ugly it.
> 
> So you really want in-place conversion where the private->shared will split
> (but not punch) and the shared->private will collapse again if possible.
> 
> > 
> > when we do shared to private conversion, no split is required as shared pages
> > are in 4K granularity. And even if user fails to specify the shared pages as
> > small pages only, the worst thing is that a 2M shared folio cannot be split, and
> > more memory is consumed.
> > 
> > Of couse, memory fragmentation is still an issue as the private pages are
> > allocated unmovable.
> 
> Yes, and that you will never ever get a "THP" back when there was a
> conversion from private->shared of a single page that split the THP and
> discarded that page.
Yes, unless we still keep that page in page cache, which would consume even more
memory.
 
>  But do you think it's a good simpler start before in-place
> > conversion is ready?
> 
> There was a discussion on that on the bi-weekly upstream meeting on February
> the 6. The recording has more details, I summarized it as
> 
> "David: Probably a good idea to focus on the long-term use case where we
> have in-place conversion support, and only allow truncation in hugepage
> (e.g., 2 MiB) size; conversion shared<->private could still be done on 4 KiB
> granularity as for hugetlb."
Will check and study it. Thanks for directing me to the history.

> In general, I think our time is better spent working on the real deal than
> on interim solutions that should not be called "THP support".
I see. Thanks for the explanation!