On Fri, Mar 22, 2024 at 10:21:09PM +0100, David Hildenbrand wrote: > On 22.03.24 18:52, David Hildenbrand wrote: > > On 19.03.24 15:31, Will Deacon wrote: > > > Hi David, > > > > Hi Will, > > > > sorry for the late reply! > > > > > > > > On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote: > > > > On 19.03.24 01:10, Sean Christopherson wrote: > > > > > On Mon, Mar 18, 2024, Vishal Annapurve wrote: > > > > > > On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > > > > > > Second, we should find better ways to let an IOMMU map these pages, > > > > > > > *not* using GUP. There were already discussions on providing a similar > > > > > > > fd+offset-style interface instead. GUP really sounds like the wrong > > > > > > > approach here. Maybe we should look into passing not only guest_memfd, > > > > > > > but also "ordinary" memfds. > > > > > > > > > > +1. I am not completely opposed to letting SNP and TDX effectively convert > > > > > pages between private and shared, but I also completely agree that letting > > > > > anything gup() guest_memfd memory is likely to end in tears. > > > > > > > > Yes. Avoid it right from the start, if possible. > > > > > > > > People wanted guest_memfd to *not* have to mmap guest memory ("even for > > > > ordinary VMs"). Now people are saying we have to be able to mmap it in order > > > > to GUP it. It's getting tiring, really. > > > > > > From the pKVM side, we're working on guest_memfd primarily to avoid > > > diverging from what other CoCo solutions end up using, but if it gets > > > de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do > > > today with anonymous memory, then it's a really hard sell to switch over > > > from what we have in production. We're also hoping that, over time, > > > guest_memfd will become more closely integrated with the mm subsystem to > > > enable things like hypervisor-assisted page migration, which we would > > > love to have. > > > > Reading Sean's reply, he has a different view on that. And I think > > that's the main issue: there are too many different use cases and too > > many different requirements that could turn guest_memfd into something > > that maybe it really shouldn't be. > > > > > > > > Today, we use the existing KVM interfaces (i.e. based on anonymous > > > memory) and it mostly works with the one significant exception that > > > accessing private memory via a GUP pin will crash the host kernel. If > > > all guest_memfd() can offer to solve that problem is preventing GUP > > > altogether, then I'd sooner just add that same restriction to what we > > > currently have instead of overhauling the user ABI in favour of > > > something which offers us very little in return. > > > > > > On the mmap() side of things for guest_memfd, a simpler option for us > > > than what has currently been proposed might be to enforce that the VMM > > > has unmapped all private pages on vCPU run, failing the ioctl if that's > > > not the case. It needs a little more tracking in guest_memfd but I think > > > GUP will then fall out in the wash because only shared pages will be > > > mapped by userspace and so GUP will fail by construction for private > > > pages. > > > > > > We're happy to pursue alternative approaches using anonymous memory if > > > you'd prefer to keep guest_memfd limited in functionality (e.g. > > > preventing GUP of private pages by extending mapping_flags as per [1]), > > > but we're equally willing to contribute to guest_memfd if extensions are > > > welcome. > > > > > > What do you prefer? > > > > Let me summarize the history: > > > > AMD had its thing running and it worked for them (but I recall it was > > hacky :) ). > > > > TDX made it possible to crash the machine when accessing secure memory > > from user space (MCE). > > > > So secure memory must not be mapped into user space -- no page tables. > > Prototypes with anonymous memory existed (and I didn't hate them, > > although hacky), but one of the other selling points of guest_memfd was > > that we could create VMs that wouldn't need any page tables at all, > > which I found interesting. > > > > There was a bit more to that (easier conversion, avoiding GUP, > > specifying on allocation that the memory was unmovable ...), but I'll > > get to that later. > > > > The design principle was: nasty private memory (unmovable, unswappable, > > inaccessible, un-GUPable) is allocated from guest_memfd, ordinary > > "shared" memory is allocated from an ordinary memfd. > > > > This makes sense: shared memory is neither nasty nor special. You can > > migrate it, swap it out, map it into page tables, GUP it, ... without > > any issues. > > > > > > So if I would describe some key characteristics of guest_memfd as of > > today, it would probably be: > > > > 1) Memory is unmovable and unswappable. Right from the beginning, it is > > allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...). > > 2) Memory is inaccessible. It cannot be read from user space, the > > kernel, it cannot be GUP'ed ... only some mechanisms might end up > > touching that memory (e.g., hibernation, /proc/kcore) might end up > > touching it "by accident", and we usually can handle these cases. > > 3) Memory can be discarded in page granularity. There should be no cases > > where you cannot discard memory to over-allocate memory for private > > pages that have been replaced by shared pages otherwise. > > 4) Page tables are not required (well, it's an memfd), and the fd could > > in theory be passed to other processes. > > > > Having "ordinary shared" memory in there implies that 1) and 2) will > > have to be adjusted for them, which kind-of turns it "partially" into > > ordinary shmem again. > > > > > > Going back to the beginning: with pKVM, we likely want the following > > > > 1) Convert pages private<->shared in-place > > 2) Stop user space + kernel from accessing private memory in process > > context. Likely for pKVM we would only crash the process, which > > would be acceptable. > > 3) Prevent GUP to private memory. Otherwise we could crash the kernel. > > 4) Prevent private pages from swapout+migration until supported. > > > > > > I suspect your current solution with anonymous memory gets all but 3) > > sorted out, correct? > > > > I'm curious, may there be a requirement in the future that shared memory > > could be mapped into other processes? (thinking vhost-user and such > > things). Of course that's impossible with anonymous memory; teaching > > shmem to contain private memory would kind-of lead to ... guest_memfd, > > just that we don't have shared memory there. > > > > I was just thinking of something stupid, not sure if it makes any sense. > I'll raise it here before I forget over the weekend. > > ... what if we glued one guest_memfd and a memfd (shmem) together in the > kernel somehow? > > (1) A to-shared conversion moves a page from the guest_memfd to the memfd. > > (2) A to-private conversion moves a page from the memfd to the guest_memfd. > > Only the memfd can be mmap'ed/read/written/GUP'ed. Pages in the memfd behave > like any shmem pages: migratable, swappable etc. > > > Of course, (2) is only possible if the page is not pinned, not mapped (we > can unmap it). AND, the page must not reside on ZONE_MOVABLE / MIGRATE_CMA. > Quentin gave idea offline of using splice to achieve the conversions. I'd want to use the in-kernel APIs on page-fault to do the conversion; not requiring userspace to make the splice() syscall. One thing splice currently requires is the source (in) file; KVM UAPI today only gives userspace address. We could resolve that by for_each_vma_range(). I've just started looking into splice(), but I believe it takes care of not pinned and not mapped. guest_memfd would have to migrate the page out of ZONE_MOVABLE / MIGRATE_CMA. Does this seem like a good path to pursue further or any other ideas for doing the conversion? > We'd have to decide what to do when we access a "hole" in the memfd -- > instead of allocating a fresh page and filling the hole, we'd want to > SIGBUS. Since the KVM UAPI is based on userspace addresses and not fds for the shared memory part, maybe we could add a mmu_notifier_ops that allows KVM to intercept and reject faults if we couldn't reclaim the memory. I think it would be conceptually similar to userfaultfd except in the kernel; not sure if re-using userfaultfd makes sense? Thanks, Elliot