Re: [PATCH RFC 0/5] mm/gup: Introduce exclusive GUP pinning

David Hildenbrand <david@xxxxxxxxxx> · Thu, 20 Jun 2024 16:45:08 +0200

On 20.06.24 16:29, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 04:01:08PM +0200, David Hildenbrand wrote:
On 20.06.24 15:55, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 09:32:11AM +0100, Fuad Tabba wrote:
Hi,

On Thu, Jun 20, 2024 at 5:11 AM Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:

On Wed, Jun 19, 2024 at 08:51:35AM -0300, Jason Gunthorpe wrote:
If you can't agree with the guest_memfd people on how to get there
then maybe you need a guest_memfd2 for this slightly different special
stuff instead of intruding on the core mm so much. (though that would
be sad)

Or we're just not going to support it at all.  It's not like supporting
this weird usage model is a must-have for Linux to start with.

Sorry, but could you please clarify to me what usage model you're
referring to exactly, and why you think it's weird? It's just that we
have covered a few things in this thread, and to me it's not clear if
you're referring to protected VMs sharing memory, or being able to
(conditionally) map a VM's memory that's backed by guest_memfd(), or
if it's the Exclusive pin.

Personally I think mapping memory under guest_memfd is pretty weird.

I don't really understand why you end up with something different than
normal CC. Normal CC has memory that the VMM can access and memory it
cannot access. guest_memory is supposed to hold the memory the VMM cannot
reach, right?

So how does normal CC handle memory switching between private and
shared and why doesn't that work for pKVM? I think the normal CC path
effectively discards the memory content on these switches and is
slow. Are you trying to make the switch content preserving and faster?

If yes, why? What is wrong with the normal CC model of slow and
non-preserving shared memory?

I'll leave the !huge page part to Fuad.

Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is shared,
now the VM requests to make one subpage private.

I think the general CC model has the shared/private setup earlier on
the VM lifecycle with large runs of contiguous pages. It would only
become a problem if you intend to to high rate fine granual
shared/private switching. Which is why I am asking what the actual
"why" is here.

I am not an expert on that, but I remember that the way memory 
shared<->private conversion happens can heavily depend on the VM use 
case, and that under pKVM we might see more frequent conversion, without 
even going to user space.

How to handle that without eventually running into a double
memory-allocation? (in the worst case, allocating a 1GiB huge page
for shared and for private memory).

I expect you'd take the linear range of 1G of PFNs and fragment it
into three ranges private/shared/private that span the same 1G.

When you construct a page table (ie a S2) that holds these three
ranges and has permission to access all the memory you want the page
table to automatically join them back together into 1GB entry.

When you construct a page table that has only access to the shared,
then you'd only install the shared hole at its natural best size.

So, I think there are two challenges - how to build an allocator and
uAPI to manage this sort of stuff so you can keep track of any
fractured pfns and ensure things remain in physical order.

Then how to re-consolidate this for the KVM side of the world.

Exactly!

guest_memfd, or something like it, is just really a good answer. You
have it obtain the huge folio, and keep track on its own which sub
pages can be mapped to a VMA because they are shared. KVM will obtain
the PFNs directly from the fd and KVM will not see the shared
holes. This means your S2's can be trivially constructed correctly.

No need to double allocate..

Yes, that's why my thinking so far was:

Let guest_memfd (or something like that) consume huge pages (somehow, 
let it access the hugetlb reserves). Preallocate that memory once, as 
the VM starts up: just like we do with hugetlb in VMs.

Let KVM track which parts are shared/private, and if required, let it 
map only the shared parts to user space. KVM has all information to make 
these decisions.

If we could disallow pinning any shared pages, that would make life a 
lot easier, but I think there were reasons for why we might require it. 
To convert shared->private, simply unmap that folio (only the shared 
parts could possibly be mapped) from all user page tables.

Of course, there might be alternatives, and I'll be happy to learn about 
them. The allcoator part would be fairly easy, and the uAPI part would 
similarly be comparably easy. So far the theory :)

I'm kind of surprised the CC folks don't want the same thing for
exactly the same reason. It is much easier to recover the huge
mappings for the S2 in the presence of shared holes if you track it
this way. Even CC will have this problem, to some degree, too.

Precisely! RH (and therefore, me) is primarily interested in existing 
guest_memfd users at this point ("CC"), and I don't see an easy way to 
get that running with huge pages in the existing model reasonably well ...

--
Cheers,

David / dhildenb