Re: Re: folio_mmapped

Elliot Berman <quic_eberman@xxxxxxxxxxx> · Tue, 26 Mar 2024 15:04:14 -0700

On Fri, Mar 22, 2024 at 10:21:09PM +0100, David Hildenbrand wrote:
> On 22.03.24 18:52, David Hildenbrand wrote:
> > On 19.03.24 15:31, Will Deacon wrote:
> > > Hi David,
> > 
> > Hi Will,
> > 
> > sorry for the late reply!
> > 
> > > 
> > > On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
> > > > On 19.03.24 01:10, Sean Christopherson wrote:
> > > > > On Mon, Mar 18, 2024, Vishal Annapurve wrote:
> > > > > > On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
> > > > > > > Second, we should find better ways to let an IOMMU map these pages,
> > > > > > > *not* using GUP. There were already discussions on providing a similar
> > > > > > > fd+offset-style interface instead. GUP really sounds like the wrong
> > > > > > > approach here. Maybe we should look into passing not only guest_memfd,
> > > > > > > but also "ordinary" memfds.
> > > > > 
> > > > > +1.  I am not completely opposed to letting SNP and TDX effectively convert
> > > > > pages between private and shared, but I also completely agree that letting
> > > > > anything gup() guest_memfd memory is likely to end in tears.
> > > > 
> > > > Yes. Avoid it right from the start, if possible.
> > > > 
> > > > People wanted guest_memfd to *not* have to mmap guest memory ("even for
> > > > ordinary VMs"). Now people are saying we have to be able to mmap it in order
> > > > to GUP it. It's getting tiring, really.
> > > 
> > >   From the pKVM side, we're working on guest_memfd primarily to avoid
> > > diverging from what other CoCo solutions end up using, but if it gets
> > > de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
> > > today with anonymous memory, then it's a really hard sell to switch over
> > > from what we have in production. We're also hoping that, over time,
> > > guest_memfd will become more closely integrated with the mm subsystem to
> > > enable things like hypervisor-assisted page migration, which we would
> > > love to have.
> > 
> > Reading Sean's reply, he has a different view on that. And I think
> > that's the main issue: there are too many different use cases and too
> > many different requirements that could turn guest_memfd into something
> > that maybe it really shouldn't be.
> > 
> > > 
> > > Today, we use the existing KVM interfaces (i.e. based on anonymous
> > > memory) and it mostly works with the one significant exception that
> > > accessing private memory via a GUP pin will crash the host kernel. If
> > > all guest_memfd() can offer to solve that problem is preventing GUP
> > > altogether, then I'd sooner just add that same restriction to what we
> > > currently have instead of overhauling the user ABI in favour of
> > > something which offers us very little in return.
> > > 
> > > On the mmap() side of things for guest_memfd, a simpler option for us
> > > than what has currently been proposed might be to enforce that the VMM
> > > has unmapped all private pages on vCPU run, failing the ioctl if that's
> > > not the case. It needs a little more tracking in guest_memfd but I think
> > > GUP will then fall out in the wash because only shared pages will be
> > > mapped by userspace and so GUP will fail by construction for private
> > > pages.
> > > 
> > > We're happy to pursue alternative approaches using anonymous memory if
> > > you'd prefer to keep guest_memfd limited in functionality (e.g.
> > > preventing GUP of private pages by extending mapping_flags as per [1]),
> > > but we're equally willing to contribute to guest_memfd if extensions are
> > > welcome.
> > > 
> > > What do you prefer?
> > 
> > Let me summarize the history:
> > 
> > AMD had its thing running and it worked for them (but I recall it was
> > hacky :) ).
> > 
> > TDX made it possible to crash the machine when accessing secure memory
> > from user space (MCE).
> > 
> > So secure memory must not be mapped into user space -- no page tables.
> > Prototypes with anonymous memory existed (and I didn't hate them,
> > although hacky), but one of the other selling points of guest_memfd was
> > that we could create VMs that wouldn't need any page tables at all,
> > which I found interesting.
> > 
> > There was a bit more to that (easier conversion, avoiding GUP,
> > specifying on allocation that the memory was unmovable ...), but I'll
> > get to that later.
> > 
> > The design principle was: nasty private memory (unmovable, unswappable,
> > inaccessible, un-GUPable) is allocated from guest_memfd, ordinary
> > "shared" memory is allocated from an ordinary memfd.
> > 
> > This makes sense: shared memory is neither nasty nor special. You can
> > migrate it, swap it out, map it into page tables, GUP it, ... without
> > any issues.
> > 
> > 
> > So if I would describe some key characteristics of guest_memfd as of
> > today, it would probably be:
> > 
> > 1) Memory is unmovable and unswappable. Right from the beginning, it is
> >      allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...).
> > 2) Memory is inaccessible. It cannot be read from user space, the
> >      kernel, it cannot be GUP'ed ... only some mechanisms might end up
> >      touching that memory (e.g., hibernation, /proc/kcore) might end up
> >      touching it "by accident", and we usually can handle these cases.
> > 3) Memory can be discarded in page granularity. There should be no cases
> >      where you cannot discard memory to over-allocate memory for private
> >      pages that have been replaced by shared pages otherwise.
> > 4) Page tables are not required (well, it's an memfd), and the fd could
> >      in theory be passed to other processes.
> > 
> > Having "ordinary shared" memory in there implies that 1) and 2) will
> > have to be adjusted for them, which kind-of turns it "partially" into
> > ordinary shmem again.
> > 
> > 
> > Going back to the beginning: with pKVM, we likely want the following
> > 
> > 1) Convert pages private<->shared in-place
> > 2) Stop user space + kernel from accessing private memory in process
> >      context. Likely for pKVM we would only crash the process, which
> >      would be acceptable.
> > 3) Prevent GUP to private memory. Otherwise we could crash the kernel.
> > 4) Prevent private pages from swapout+migration until supported.
> > 
> > 
> > I suspect your current solution with anonymous memory gets all but 3)
> > sorted out, correct?
> > 
> > I'm curious, may there be a requirement in the future that shared memory
> > could be mapped into other processes? (thinking vhost-user and such
> > things). Of course that's impossible with anonymous memory; teaching
> > shmem to contain private memory would kind-of lead to ... guest_memfd,
> > just that we don't have shared memory there.
> > 
> 
> I was just thinking of something stupid, not sure if it makes any sense.
> I'll raise it here before I forget over the weekend.
> 
> ... what if we glued one guest_memfd and a memfd (shmem) together in the
> kernel somehow?
> 
> (1) A to-shared conversion moves a page from the guest_memfd to the memfd.
> 
> (2) A to-private conversion moves a page from the memfd to the guest_memfd.
> 
> Only the memfd can be mmap'ed/read/written/GUP'ed. Pages in the memfd behave
> like any shmem pages: migratable, swappable etc.
> 
> 
> Of course, (2) is only possible if the page is not pinned, not mapped (we
> can unmap it). AND, the page must not reside on ZONE_MOVABLE / MIGRATE_CMA.
> 

Quentin gave idea offline of using splice to achieve the conversions.
I'd want to use the in-kernel APIs on page-fault to do the conversion;
not requiring userspace to make the splice() syscall.  One thing splice
currently requires is the source (in) file; KVM UAPI today only gives
userspace address. We could resolve that by for_each_vma_range(). I've
just started looking into splice(), but I believe it takes care of not
pinned and not mapped. guest_memfd would have to migrate the page out of
ZONE_MOVABLE / MIGRATE_CMA.

Does this seem like a good path to pursue further or any other ideas for
doing the conversion?

> We'd have to decide what to do when we access a "hole" in the memfd --
> instead of allocating a fresh page and filling the hole, we'd want to
> SIGBUS.

Since the KVM UAPI is based on userspace addresses and not fds for the
shared memory part, maybe we could add a mmu_notifier_ops that allows
KVM to intercept and reject faults if we couldn't reclaim the memory. I
think it would be conceptually similar to userfaultfd except in the
kernel; not sure if re-using userfaultfd makes sense?

Thanks,
Elliot