Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private

David Hildenbrand <david@xxxxxxxxxx> · Wed, 16 Oct 2024 13:54:32 +0200

On 16.10.24 12:48, Vishal Annapurve wrote:
On Wed, Oct 16, 2024 at 2:20 PM David Hildenbrand <david@xxxxxxxxxx> wrote:

I also don't know how you treat things like folio_test_hugetlb() on
possible assumptions that the VMA must be a hugetlb vma.  I'd confess I
didn't yet check the rest of the patchset yet - reading a large series
without a git tree is sometimes challenging to me.

I'm thinking to basically never involve folio_test_hugetlb(), and the
VMAs used by guest_memfd will also never be a HugeTLB VMA. That's
because only the HugeTLB allocator is used, but by the time the folio is
mapped to userspace, it would have already have been split. After the
page is split, the folio loses its HugeTLB status. guest_memfd folios
will never be mapped to userspace while they still have a HugeTLB
status.

We absolutely must convert these hugetlb folios to non-hugetlb folios.

That is one of the reasons why I raised at LPC that we should focus on
leaving hugetlb out of the picture and rather have a global pool, and
the option to move folios from the global pool back and forth to hugetlb
or to guest_memfd.

How exactly that would look like is TBD.

For the time being, I think we could add a "hack" to take hugetlb folios
from hugetlb for our purposes, but we would absolutely have to convert
them to non-hugetlb folios, especially when we split them to small
folios and start using the mapcount. But it doesn't feel quite clean.

As hugepage folios need to be split up in order to support backing
CoCo VMs with hugepages, I would assume any folio based hugepage
memory allocation will need to go through split/merge cycles through
the guest memfd lifetime.

Yes, that's my understanding as well.

Plan through next RFC series is to abstract out the hugetlb folio
management within guest_memfd so that any hugetlb specific logic is
cleanly separated out and allows guest memfd to allocate memory from
other hugepage allocators in the future.

Yes, that must happen. As soon as a hugetlb folio would transition to 
guest_memfd, it must no longer be a hugetlb folio.

Simply starting with a separate global pool (e.g., boot-time allocation
similar to as done by hugetlb, or CMA) might be cleaner, and a lot of
stuff could be factored out from hugetlb code to achieve that.

I am not sure if a separate global pool necessarily solves all the
issues here unless we come up with more concrete implementation
details. One of the concerns was the ability of implementing/retaining
HVO while transferring memory between the separate global pool and
hugetlb pool i.e. whether it can seamlessly serve all hugepage users
on the host.

Likely should be doable. All we need is the generalized concept of a 
folio with HVO, and a way to move these folios between owners (e.g., 
global<->hugetlb, global<->guest_memfd).

Factoring the HVO optimization out shouldn't be too crazy I believe. 
Famous last words :)

Another question could be whether the separate
pool/allocator simplifies the split/merge operations at runtime.

The less hugetlb hacks we have to add, the better :)

--
Cheers,

David / dhildenb