Re: [PATCH RFC 0/5] mm/gup: Introduce exclusive GUP pinning

David Hildenbrand <david@xxxxxxxxxx> · Fri, 21 Jun 2024 10:02:08 +0200

On 21.06.24 09:32, Quentin Perret wrote:
On Thursday 20 Jun 2024 at 20:18:14 (-0300), Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 03:47:23PM -0700, Elliot Berman wrote:
On Thu, Jun 20, 2024 at 11:29:56AM -0300, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 04:01:08PM +0200, David Hildenbrand wrote:
Regarding huge pages: assume the huge page (e.g., 1 GiB hugetlb) is shared,
now the VM requests to make one subpage private.

I think the general CC model has the shared/private setup earlier on
the VM lifecycle with large runs of contiguous pages. It would only
become a problem if you intend to to high rate fine granual
shared/private switching. Which is why I am asking what the actual
"why" is here.

I'd let Fuad comment if he's aware of any specific/concrete Anrdoid
usecases about converting between shared and private. One usecase I can
think about is host providing large multimedia blobs (e.g. video) to the
guest. Rather than using swiotlb, the CC guest can share pages back with
the host so host can copy the blob in, possibly using H/W accel. I
mention this example because we may not need to support shared/private
conversions at granularity finer than huge pages.

I suspect the more useful thing would be to be able to allocate actual
shared memory and use that to shuffle data without a copy, setup much
less frequently. Ie you could allocate a large shared buffer for video
sharing and stream the video frames through that memory without copy.

This is slightly different from converting arbitary memory in-place
into shared memory. The VM may be able to do a better job at
clustering the shared memory allocation requests, ie locate them all
within a 1GB region to further optimize the host side.

Jason, do you have scenario in mind? I couldn't tell if we now had a
usecase or are brainstorming a solution to have a solution.

No, I'm interested in what pKVM is doing that needs this to be so much
different than the CC case..

The underlying technology for implementing CC is obviously very
different (MMU-based for pKVM, encryption-based for the others + some
extra bits but let's keep it simple). In-place conversion is inherently
painful with encryption-based schemes, so it's not a surprise the
approach taken in these cases is built around destructive conversions as
a core construct. But as Elliot highlighted, the MMU-based approach
allows for pretty flexible and efficient zero-copy, which we're not
ready to sacrifice purely to shoehorn pKVM into a model that was
designed for a technology that has very different set of constraints.
A private->shared conversion in the pKVM case is nothing more than
setting a PTE in the recipient's stage-2 page-table.

I'm not at all against starting with something simple and bouncing via
swiotlb, that is totally fine. What is _not_ fine however would be to
bake into the userspace API that conversions are not in-place and
destructive (which in my mind equates to 'you can't mmap guest_memfd
pages'). But I think that isn't really a point of disagreement these
days, so hopefully we're aligned.

And to clarify some things I've also read in the thread, pKVM can
handle the vast majority of faults caused by accesses to protected
memory just fine. Userspace accesses protected guest memory? Fine,
we'll SEGV the userspace process. The kernel accesses via uaccess
macros? Also fine, we'll fail the syscall (or whatever it is we're
doing) cleanly -- the whole extable machinery works OK, which also
means that things like load_unaligned_zeropad() keep working as-is.
The only thing pKVM does is re-inject the fault back into the kernel
with some extra syndrome information it can figure out what to do by
itself.

It's really only accesses via e.g. the linear map that are problematic,
hence the exclusive GUP approach proposed in the series that tries to
avoid that by construction. That has the benefit of leaving
guest_memfd to other CC solutions that have more things in common. I
think it's good for that discussion to happen, no matter what we end up
doing in the end.

Thanks for the information. IMHO we really should try to find a common 
ground here, and FOLL_EXCLUSIVE is likely not it :)

Thanks for reviving this discussion with your patch set!

pKVM is interested in in-place conversion, I believe there are valid use 
cases for in-place conversion for TDX and friends as well (as discussed, 
I think that might be a clean way to get huge/gigantic page support in).

This implies the option to:

1) Have shared+private memory in guest_memfd
2) Be able to mmap shared parts
3) Be able to convert shared<->private in place

and later in my interest

4) Have huge/gigantic page support in guest_memfd with the option of
   converting individual subpages

We might not want to make use of that model for all of CC -- as you 
state, sometimes the destructive approach might be better performance 
wise -- but having that option doesn't sound crazy to me (and maybe 
would solve real issues as well).

After all, the common requirement here is that "private" pages are not 
mapped/pinned/accessible.

Sure, there might be cases like "pKVM can handle access to private pages 
in user page mappings", "AMD-SNP will not crash the host if writing to 
private pages" but there are not factors that really make a difference 
for a common solution.

private memory: not mapped, not pinned
shared memory: maybe mapped, maybe pinned
granularity of conversion: single pages

Anything I am missing?

--
Cheers,

David / dhildenb