On Tue, Mar 18, 2025 at 08:13:05PM +0100, David Hildenbrand wrote: > On 18.03.25 03:24, Yan Zhao wrote: > > On Fri, Mar 14, 2025 at 07:19:33PM +0800, Yan Zhao wrote: > > > On Fri, Mar 14, 2025 at 10:33:07AM +0100, David Hildenbrand wrote: > > > > On 14.03.25 10:09, Yan Zhao wrote: > > > > > On Wed, Jan 22, 2025 at 03:25:29PM +0100, David Hildenbrand wrote: > > > > > > (split is possible if there are no unexpected folio references; private > > > > > > pages cannot be GUP'ed, so it is feasible) > > > > > ... > > > > > > > > Note that I'm not quite sure about the "2MB" interface, should it be > > > > > > > > a > > > > > > > > "PMD-size" interface? > > > > > > > > > > > > > > I think Mike and I touched upon this aspect too - and I may be > > > > > > > misremembering - Mike suggested getting 1M, 2M, and bigger page sizes > > > > > > > in increments -- and then fitting in PMD sizes when we've had enough of > > > > > > > those. That is to say he didn't want to preclude it, or gate the PMD > > > > > > > work on enabling all sizes first. > > > > > > > > > > > > Starting with 2M is reasonable for now. The real question is how we want to > > > > > > deal with > > > > > Hi David, > > > > > > > > > > > > > Hi! > > > > > > > > > I'm just trying to understand the background of in-place conversion. > > > > > > > > > > Regarding to the two issues you mentioned with THP and non-in-place-conversion, > > > > > I have some questions (still based on starting with 2M): > > > > > > > > > > > (a) Not being able to allocate a 2M folio reliably > > > > > If we start with fault in private pages from guest_memfd (not in page pool way) > > > > > and shared pages anonymously, is it correct to say that this is only a concern > > > > > when memory is under pressure? > > > > > > > > Usually, fragmentation starts being a problem under memory pressure, and > > > > memory pressure can show up simply because the page cache makes us of as > > > > much memory as it wants. > > > > > > > > As soon as we start allocating a 2 MB page for guest_memfd, to then split it > > > > up + free only some parts back to the buddy (on private->shared conversion), > > > > we create fragmentation that cannot get resolved as long as the remaining > > > > private pages are not freed. A new conversion from shared->private on the > > > > previously freed parts will allocate other unmovable pages (not the freed > > > > ones) and make fragmentation worse. > > > Ah, I see. The problem of fragmentation is because memory allocated by > > > guest_memfd is unmovable. So after freeing part of a 2MB folio, the whole 2MB is > > > still unmovable. > > > > > > I previously thought fragmentation would only impact the guest by providing no > > > new huge pages. So if a confidential VM does not support merging small PTEs into > > > a huge PMD entry in its private page table, even if the new huge memory range is > > > physically contiguous after a private->shared->private conversion, the guest > > > still cannot bring back huge pages. > > > > > > > In-place conversion improves that quite a lot, because guest_memfd tself > > > > will not cause unmovable fragmentation. Of course, under memory pressure, > > > > when and cannot allocate a 2M page for guest_memfd, it's unavoidable. But > > > > then, we already had fragmentation (and did not really cause any new one). > > > > > > > > We discussed in the upstream call, that if guest_memfd (primarily) only > > > > allocates 2M pages and frees 2M pages, it will not cause fragmentation > > > > itself, which is pretty nice. > > > Makes sense. > > > > > > > > > > > > > > (b) Partial discarding > > > > > For shared pages, page migration and folio split are possible for shared THP? > > > > > > > > I assume by "shared" you mean "not guest_memfd, but some other memory we use > > > Yes, not guest_memfd, in the case of non-in-place conversion. > > > > > > > as an overlay" -- so no in-place conversion. > > > > > > > > Yes, that should be possible as long as nothing else prevents > > > > migration/split (e.g., longterm pinning) > > > > > > > > > > > > > > For private pages, as you pointed out earlier, if we can ensure there are no > > > > > unexpected folio references for private memory, splitting a private huge folio > > > > > should succeed. > > > > > > > > Yes, and maybe (hopefully) we'll reach a point where private parts will not > > > > have a refcount at all (initially, frozen refcount, discussed during the > > > > last upstream call). > > > Yes, I also tested in TDX by not acquiring folio ref count in TDX specific code > > > and found that partial splitting could work. > > > > > > > Are you concerned about the memory fragmentation after repeated > > > > > partial conversions of private pages to and from shared? > > > > > > > > Not only repeated, even just a single partial conversion. But of course, > > > > repeated partial conversions will make it worse (e.g., never getting a > > > > private huge page back when there was a partial conversion). > > > Thanks for the explanation! > > > > > > Do you think there's any chance for guest_memfd to support non-in-place > > > conversion first? > > e.g. we can have private pages allocated from guest_memfd and allows the > > private pages to be THP. > > > > Meanwhile, shared pages are not allocated from guest_memfd, and let it only > > fault in 4K granularity. (specify it by a flag?) > > > > When we want to convert a 4K from a 2M private folio to shared, we can just > > split the 2M private folio as there's no extra ref count of private pages; > > Yes, IIRC that's precisely what this series is doing, because the > ftruncate() will try splitting the folio (which might still fail on > speculative references, see my comment as rely to this series) > > In essence: yes, splitting to 4k should work (although speculative reference > might require us to retry). But the "4k hole punch" is the ugly it. > > So you really want in-place conversion where the private->shared will split > (but not punch) and the shared->private will collapse again if possible. > > > > > when we do shared to private conversion, no split is required as shared pages > > are in 4K granularity. And even if user fails to specify the shared pages as > > small pages only, the worst thing is that a 2M shared folio cannot be split, and > > more memory is consumed. > > > > Of couse, memory fragmentation is still an issue as the private pages are > > allocated unmovable. > > Yes, and that you will never ever get a "THP" back when there was a > conversion from private->shared of a single page that split the THP and > discarded that page. Yes, unless we still keep that page in page cache, which would consume even more memory. > But do you think it's a good simpler start before in-place > > conversion is ready? > > There was a discussion on that on the bi-weekly upstream meeting on February > the 6. The recording has more details, I summarized it as > > "David: Probably a good idea to focus on the long-term use case where we > have in-place conversion support, and only allow truncation in hugepage > (e.g., 2 MiB) size; conversion shared<->private could still be done on 4 KiB > granularity as for hugetlb." Will check and study it. Thanks for directing me to the history. > In general, I think our time is better spent working on the real deal than > on interim solutions that should not be called "THP support". I see. Thanks for the explanation!