Re: folio_mmapped

David Hildenbrand <david@xxxxxxxxxx> · Tue, 19 Mar 2024 14:19:37 +0100

I had started a discussion for this [2] using an RFC series.

David is talking about the host side of things, AFAICT you're talking about the
guest side...

challenge here remain:
1) Unifying all the conversions under one layer
2) Ensuring shared memory allocations are huge page aligned at boot
time and runtime.

Using any kind of unified shared memory allocator (today this part is
played by SWIOTLB) will need to support huge page aligned dynamic
increments, which can be only guaranteed by carving out enough memory
at boot time for CMA area and using CMA area for allocation at
runtime.
     - Since it's hard to come up with a maximum amount of shared memory
needed by VM, especially with GPUs/TPUs around, it's difficult to come
up with CMA area size at boot time.

...which is very relevant as carving out memory in the guest is nigh impossible,
but carving out memory in the host for systems whose sole purpose is to run VMs
is very doable.

I think it's arguable that even if a VM converts 10 % of its memory to
shared using 4k granularity, we still have fewer page table walks on
the rest of the memory when using 1G/2M pages, which is a significant
portion.

Performance is a secondary concern.  If this were _just_ about guest performance,
I would unequivocally side with David: the guest gets to keep the pieces if it
fragments a 1GiB page.

The main problem we're trying to solve is that we want to provision a host such
that the host can serve 1GiB pages for non-CoCo VMs, and can also simultaneously
run CoCo VMs, with 100% fungibility.  I.e. a host could run 100% non-CoCo VMs,
100% CoCo VMs, or more likely, some sliding mix of the two.  Ideally, CoCo VMs
would also get the benefits of 1GiB mappings, that's not the driving motiviation
for this discussion.

Supporting 1 GiB mappings there sounds like unnecessary complexity and
opening a big can of worms, especially if "it's not the driving motivation".

If I understand you correctly, the scenario is

(1) We have free 1 GiB hugetlb pages lying around
(2) We want to start a CoCo VM
(3) We don't care about 1 GiB mappings for that CoCo VM, but hguetlb
      pages is all we have.
(4) We want to be able to use the 1 GiB hugetlb page in the future.

With hugetlb, it's possible to reserve a CMA area from which to later
allocate 1 GiB pages. While not allocated, they can be used for movable
allocations.

So in the scenario above, free the hugetlb pages back to CMA. Then,
consume them as 4K pages for the CoCo VM. When wanting to start a
non-CoCo VM, re-allocate them from CMA.

One catch with that is that
(a) CMA pages cannot get longterm-pinned: for obvious reasons, we
      wouldn't be able to migrate them in order to free up the 1 GiB page.
(b) guest_memfd pages are not movable and cannot currently end up on CMA
      memory.

But maybe that's not actually required in this scenario and we'd like to
have slightly different semantics: if you were to give the CoCo VM the 1
GiB pages, they would similarly be unusable until that VM quit and freed
up the memory!

So it might be acceptable to get "selected" unmovable allocations (from
guest_memfd) on selected (hugetlb) CMA area, that the "owner" will free
up when wanting to re-allocate that memory. Otherwise, the CMA
allocation will simply fail.

If we need improvements in that area to support this case, we can talk.
Just an idea to avoid HGM and friends just to make it somehow work with
1 GiB pages ...

Thought about that some more and some cases can also be tricky (avoiding 
fragmenting multiple 1 GiB pages ...).

It's all tricky, especially once multiple (guest_)memfds are involved 
and we really want to avoid most waste. Knowing that large mappings for 
CoCo are rather "optional" and that the challenge is in "reusing" large 
pages is valuable, though.

--
Cheers,

David / dhildenb