Re: folio_mmapped

David Hildenbrand <david@xxxxxxxxxx> · Fri, 22 Mar 2024 22:21:09 +0100

On 22.03.24 18:52, David Hildenbrand wrote:
On 19.03.24 15:31, Will Deacon wrote:
Hi David,

Hi Will,

sorry for the late reply!

On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
On 19.03.24 01:10, Sean Christopherson wrote:
On Mon, Mar 18, 2024, Vishal Annapurve wrote:
On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
Second, we should find better ways to let an IOMMU map these pages,
*not* using GUP. There were already discussions on providing a similar
fd+offset-style interface instead. GUP really sounds like the wrong
approach here. Maybe we should look into passing not only guest_memfd,
but also "ordinary" memfds.

+1.  I am not completely opposed to letting SNP and TDX effectively convert
pages between private and shared, but I also completely agree that letting
anything gup() guest_memfd memory is likely to end in tears.

Yes. Avoid it right from the start, if possible.

People wanted guest_memfd to *not* have to mmap guest memory ("even for
ordinary VMs"). Now people are saying we have to be able to mmap it in order
to GUP it. It's getting tiring, really.

  From the pKVM side, we're working on guest_memfd primarily to avoid
diverging from what other CoCo solutions end up using, but if it gets
de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
today with anonymous memory, then it's a really hard sell to switch over
from what we have in production. We're also hoping that, over time,
guest_memfd will become more closely integrated with the mm subsystem to
enable things like hypervisor-assisted page migration, which we would
love to have.

Reading Sean's reply, he has a different view on that. And I think
that's the main issue: there are too many different use cases and too
many different requirements that could turn guest_memfd into something
that maybe it really shouldn't be.

Today, we use the existing KVM interfaces (i.e. based on anonymous
memory) and it mostly works with the one significant exception that
accessing private memory via a GUP pin will crash the host kernel. If
all guest_memfd() can offer to solve that problem is preventing GUP
altogether, then I'd sooner just add that same restriction to what we
currently have instead of overhauling the user ABI in favour of
something which offers us very little in return.

On the mmap() side of things for guest_memfd, a simpler option for us
than what has currently been proposed might be to enforce that the VMM
has unmapped all private pages on vCPU run, failing the ioctl if that's
not the case. It needs a little more tracking in guest_memfd but I think
GUP will then fall out in the wash because only shared pages will be
mapped by userspace and so GUP will fail by construction for private
pages.

We're happy to pursue alternative approaches using anonymous memory if
you'd prefer to keep guest_memfd limited in functionality (e.g.
preventing GUP of private pages by extending mapping_flags as per [1]),
but we're equally willing to contribute to guest_memfd if extensions are
welcome.

What do you prefer?

Let me summarize the history:

AMD had its thing running and it worked for them (but I recall it was
hacky :) ).

TDX made it possible to crash the machine when accessing secure memory
from user space (MCE).

So secure memory must not be mapped into user space -- no page tables.
Prototypes with anonymous memory existed (and I didn't hate them,
although hacky), but one of the other selling points of guest_memfd was
that we could create VMs that wouldn't need any page tables at all,
which I found interesting.

There was a bit more to that (easier conversion, avoiding GUP,
specifying on allocation that the memory was unmovable ...), but I'll
get to that later.

The design principle was: nasty private memory (unmovable, unswappable,
inaccessible, un-GUPable) is allocated from guest_memfd, ordinary
"shared" memory is allocated from an ordinary memfd.

This makes sense: shared memory is neither nasty nor special. You can
migrate it, swap it out, map it into page tables, GUP it, ... without
any issues.

So if I would describe some key characteristics of guest_memfd as of
today, it would probably be:

1) Memory is unmovable and unswappable. Right from the beginning, it is
     allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...).
2) Memory is inaccessible. It cannot be read from user space, the
     kernel, it cannot be GUP'ed ... only some mechanisms might end up
     touching that memory (e.g., hibernation, /proc/kcore) might end up
     touching it "by accident", and we usually can handle these cases.
3) Memory can be discarded in page granularity. There should be no cases
     where you cannot discard memory to over-allocate memory for private
     pages that have been replaced by shared pages otherwise.
4) Page tables are not required (well, it's an memfd), and the fd could
     in theory be passed to other processes.

Having "ordinary shared" memory in there implies that 1) and 2) will
have to be adjusted for them, which kind-of turns it "partially" into
ordinary shmem again.

Going back to the beginning: with pKVM, we likely want the following

1) Convert pages private<->shared in-place
2) Stop user space + kernel from accessing private memory in process
     context. Likely for pKVM we would only crash the process, which
     would be acceptable.
3) Prevent GUP to private memory. Otherwise we could crash the kernel.
4) Prevent private pages from swapout+migration until supported.

I suspect your current solution with anonymous memory gets all but 3)
sorted out, correct?

I'm curious, may there be a requirement in the future that shared memory
could be mapped into other processes? (thinking vhost-user and such
things). Of course that's impossible with anonymous memory; teaching
shmem to contain private memory would kind-of lead to ... guest_memfd,
just that we don't have shared memory there.

I was just thinking of something stupid, not sure if it makes any sense. 
I'll raise it here before I forget over the weekend.

... what if we glued one guest_memfd and a memfd (shmem) together in the 
kernel somehow?

(1) A to-shared conversion moves a page from the guest_memfd to the memfd.

(2) A to-private conversion moves a page from the memfd to the guest_memfd.

Only the memfd can be mmap'ed/read/written/GUP'ed. Pages in the memfd 
behave like any shmem pages: migratable, swappable etc.

Of course, (2) is only possible if the page is not pinned, not mapped 
(we can unmap it). AND, the page must not reside on ZONE_MOVABLE / 
MIGRATE_CMA.

We'd have to decide what to do when we access a "hole" in the memfd -- 
instead of allocating a fresh page and filling the hole, we'd want to 
SIGBUS.

--
Cheers,

David / dhildenb