On Fri, Mar 14, 2025 at 10:33:07AM +0100, David Hildenbrand wrote: > On 14.03.25 10:09, Yan Zhao wrote: > > On Wed, Jan 22, 2025 at 03:25:29PM +0100, David Hildenbrand wrote: > > > (split is possible if there are no unexpected folio references; private > > > pages cannot be GUP'ed, so it is feasible) > > ... > > > > > Note that I'm not quite sure about the "2MB" interface, should it be > > > > > a > > > > > "PMD-size" interface? > > > > > > > > I think Mike and I touched upon this aspect too - and I may be > > > > misremembering - Mike suggested getting 1M, 2M, and bigger page sizes > > > > in increments -- and then fitting in PMD sizes when we've had enough of > > > > those. That is to say he didn't want to preclude it, or gate the PMD > > > > work on enabling all sizes first. > > > > > > Starting with 2M is reasonable for now. The real question is how we want to > > > deal with > > Hi David, > > > > Hi! > > > I'm just trying to understand the background of in-place conversion. > > > > Regarding to the two issues you mentioned with THP and non-in-place-conversion, > > I have some questions (still based on starting with 2M): > > > > > (a) Not being able to allocate a 2M folio reliably > > If we start with fault in private pages from guest_memfd (not in page pool way) > > and shared pages anonymously, is it correct to say that this is only a concern > > when memory is under pressure? > > Usually, fragmentation starts being a problem under memory pressure, and > memory pressure can show up simply because the page cache makes us of as > much memory as it wants. > > As soon as we start allocating a 2 MB page for guest_memfd, to then split it > up + free only some parts back to the buddy (on private->shared conversion), > we create fragmentation that cannot get resolved as long as the remaining > private pages are not freed. A new conversion from shared->private on the > previously freed parts will allocate other unmovable pages (not the freed > ones) and make fragmentation worse. Ah, I see. The problem of fragmentation is because memory allocated by guest_memfd is unmovable. So after freeing part of a 2MB folio, the whole 2MB is still unmovable. I previously thought fragmentation would only impact the guest by providing no new huge pages. So if a confidential VM does not support merging small PTEs into a huge PMD entry in its private page table, even if the new huge memory range is physically contiguous after a private->shared->private conversion, the guest still cannot bring back huge pages. > In-place conversion improves that quite a lot, because guest_memfd tself > will not cause unmovable fragmentation. Of course, under memory pressure, > when and cannot allocate a 2M page for guest_memfd, it's unavoidable. But > then, we already had fragmentation (and did not really cause any new one). > > We discussed in the upstream call, that if guest_memfd (primarily) only > allocates 2M pages and frees 2M pages, it will not cause fragmentation > itself, which is pretty nice. Makes sense. > > > > > (b) Partial discarding > > For shared pages, page migration and folio split are possible for shared THP? > > I assume by "shared" you mean "not guest_memfd, but some other memory we use Yes, not guest_memfd, in the case of non-in-place conversion. > as an overlay" -- so no in-place conversion. > > Yes, that should be possible as long as nothing else prevents > migration/split (e.g., longterm pinning) > > > > > For private pages, as you pointed out earlier, if we can ensure there are no > > unexpected folio references for private memory, splitting a private huge folio > > should succeed. > > Yes, and maybe (hopefully) we'll reach a point where private parts will not > have a refcount at all (initially, frozen refcount, discussed during the > last upstream call). Yes, I also tested in TDX by not acquiring folio ref count in TDX specific code and found that partial splitting could work. > Are you concerned about the memory fragmentation after repeated > > partial conversions of private pages to and from shared? > > Not only repeated, even just a single partial conversion. But of course, > repeated partial conversions will make it worse (e.g., never getting a > private huge page back when there was a partial conversion). Thanks for the explanation! Do you think there's any chance for guest_memfd to support non-in-place conversion first?