Re: folio_mmapped

David Hildenbrand <david@xxxxxxxxxx> · Wed, 27 Mar 2024 18:50:02 +0100

On 26.03.24 23:04, Elliot Berman wrote:
On Fri, Mar 22, 2024 at 10:21:09PM +0100, David Hildenbrand wrote:
On 22.03.24 18:52, David Hildenbrand wrote:
On 19.03.24 15:31, Will Deacon wrote:
Hi David,

Hi Will,

sorry for the late reply!

On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
On 19.03.24 01:10, Sean Christopherson wrote:
On Mon, Mar 18, 2024, Vishal Annapurve wrote:
On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
Second, we should find better ways to let an IOMMU map these pages,
*not* using GUP. There were already discussions on providing a similar
fd+offset-style interface instead. GUP really sounds like the wrong
approach here. Maybe we should look into passing not only guest_memfd,
but also "ordinary" memfds.

+1.  I am not completely opposed to letting SNP and TDX effectively convert
pages between private and shared, but I also completely agree that letting
anything gup() guest_memfd memory is likely to end in tears.

Yes. Avoid it right from the start, if possible.

People wanted guest_memfd to *not* have to mmap guest memory ("even for
ordinary VMs"). Now people are saying we have to be able to mmap it in order
to GUP it. It's getting tiring, really.

   From the pKVM side, we're working on guest_memfd primarily to avoid
diverging from what other CoCo solutions end up using, but if it gets
de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
today with anonymous memory, then it's a really hard sell to switch over
from what we have in production. We're also hoping that, over time,
guest_memfd will become more closely integrated with the mm subsystem to
enable things like hypervisor-assisted page migration, which we would
love to have.

Reading Sean's reply, he has a different view on that. And I think
that's the main issue: there are too many different use cases and too
many different requirements that could turn guest_memfd into something
that maybe it really shouldn't be.

Today, we use the existing KVM interfaces (i.e. based on anonymous
memory) and it mostly works with the one significant exception that
accessing private memory via a GUP pin will crash the host kernel. If
all guest_memfd() can offer to solve that problem is preventing GUP
altogether, then I'd sooner just add that same restriction to what we
currently have instead of overhauling the user ABI in favour of
something which offers us very little in return.

On the mmap() side of things for guest_memfd, a simpler option for us
than what has currently been proposed might be to enforce that the VMM
has unmapped all private pages on vCPU run, failing the ioctl if that's
not the case. It needs a little more tracking in guest_memfd but I think
GUP will then fall out in the wash because only shared pages will be
mapped by userspace and so GUP will fail by construction for private
pages.

We're happy to pursue alternative approaches using anonymous memory if
you'd prefer to keep guest_memfd limited in functionality (e.g.
preventing GUP of private pages by extending mapping_flags as per [1]),
but we're equally willing to contribute to guest_memfd if extensions are
welcome.

What do you prefer?

Let me summarize the history:

AMD had its thing running and it worked for them (but I recall it was
hacky :) ).

TDX made it possible to crash the machine when accessing secure memory
from user space (MCE).

So secure memory must not be mapped into user space -- no page tables.
Prototypes with anonymous memory existed (and I didn't hate them,
although hacky), but one of the other selling points of guest_memfd was
that we could create VMs that wouldn't need any page tables at all,
which I found interesting.

There was a bit more to that (easier conversion, avoiding GUP,
specifying on allocation that the memory was unmovable ...), but I'll
get to that later.

The design principle was: nasty private memory (unmovable, unswappable,
inaccessible, un-GUPable) is allocated from guest_memfd, ordinary
"shared" memory is allocated from an ordinary memfd.

This makes sense: shared memory is neither nasty nor special. You can
migrate it, swap it out, map it into page tables, GUP it, ... without
any issues.

So if I would describe some key characteristics of guest_memfd as of
today, it would probably be:

1) Memory is unmovable and unswappable. Right from the beginning, it is
      allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...).
2) Memory is inaccessible. It cannot be read from user space, the
      kernel, it cannot be GUP'ed ... only some mechanisms might end up
      touching that memory (e.g., hibernation, /proc/kcore) might end up
      touching it "by accident", and we usually can handle these cases.
3) Memory can be discarded in page granularity. There should be no cases
      where you cannot discard memory to over-allocate memory for private
      pages that have been replaced by shared pages otherwise.
4) Page tables are not required (well, it's an memfd), and the fd could
      in theory be passed to other processes.

Having "ordinary shared" memory in there implies that 1) and 2) will
have to be adjusted for them, which kind-of turns it "partially" into
ordinary shmem again.

Going back to the beginning: with pKVM, we likely want the following

1) Convert pages private<->shared in-place
2) Stop user space + kernel from accessing private memory in process
      context. Likely for pKVM we would only crash the process, which
      would be acceptable.
3) Prevent GUP to private memory. Otherwise we could crash the kernel.
4) Prevent private pages from swapout+migration until supported.

I suspect your current solution with anonymous memory gets all but 3)
sorted out, correct?

I'm curious, may there be a requirement in the future that shared memory
could be mapped into other processes? (thinking vhost-user and such
things). Of course that's impossible with anonymous memory; teaching
shmem to contain private memory would kind-of lead to ... guest_memfd,
just that we don't have shared memory there.

I was just thinking of something stupid, not sure if it makes any sense.
I'll raise it here before I forget over the weekend.

... what if we glued one guest_memfd and a memfd (shmem) together in the
kernel somehow?

(1) A to-shared conversion moves a page from the guest_memfd to the memfd.

(2) A to-private conversion moves a page from the memfd to the guest_memfd.

Only the memfd can be mmap'ed/read/written/GUP'ed. Pages in the memfd behave
like any shmem pages: migratable, swappable etc.

Of course, (2) is only possible if the page is not pinned, not mapped (we
can unmap it). AND, the page must not reside on ZONE_MOVABLE / MIGRATE_CMA.

Quentin gave idea offline of using splice to achieve the conversions.
I'd want to use the in-kernel APIs on page-fault to do the conversion;
not requiring userspace to make the splice() syscall.  One thing splice
currently requires is the source (in) file; KVM UAPI today only gives
userspace address. We could resolve that by for_each_vma_range(). I've
just started looking into splice(), but I believe it takes care of not
pinned and not mapped. guest_memfd would have to migrate the page out of
ZONE_MOVABLE / MIGRATE_CMA.

I don't think we want to involve splice. Conceptually, I think KVM 
should create a pair of FDs: guest_memfd for private memory and 
"ordinary shmem/memfd" for shared memory.

Conversion back and forth can either be triggered using a KVM API (TDX 
use case), or internally from KVM (pkvm use case). Maybe it does 
something internally that splice would do that we can reuse, otherwise 
we have to do the plumbing.

Then, we have some logic on how to handle access to unbacked regions 
(SIGBUS instead of allocating memory) inside both memfds, and allow to 
allocate memory for parts of the fds explicitly.

No offset in the fd's can be populated the same time. That is, pages can 
be moved back and forth, but allocating a fresh page in an fd is only 
possible if there is nothing at that location in the other fd. No memory 
over-allocation.

Coming up with a KVM API for that should be possible.

Does this seem like a good path to pursue further or any other ideas for
doing the conversion?

We'd have to decide what to do when we access a "hole" in the memfd --
instead of allocating a fresh page and filling the hole, we'd want to
SIGBUS.

Since the KVM UAPI is based on userspace addresses and not fds for the
shared memory part, maybe we could add a mmu_notifier_ops that allows
KVM to intercept and reject faults if we couldn't reclaim the memory. I
think it would be conceptually similar to userfaultfd except in the
kernel; not sure if re-using userfaultfd makes sense?

Or if KVM exposes this other fd as well, we extend the UAPI to consume 
for the shared part also fd+offset.

--
Cheers,

David / dhildenb