Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

Sean Christopherson <seanjc@xxxxxxxxxx> · Fri, 14 Jul 2023 17:30:37 -0700

On Fri, Jul 14, 2023, Vishal Annapurve wrote:
> On Fri, Jul 14, 2023 at 12:29 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > ...
> > And _if_ there is a VMM that instantiates memory before KVM_CREATE_VM, IMO making
> > the ioctl() /dev/kvm scoped would have no meaningful impact on adapting userspace
> > to play nice with the required ordering.  If userspace can get at /dev/kvm, then
> > it can do KVM_CREATE_VM, because the only input to KVM_CREATE_VM is the type, i.e.
> > the only dependencies for KVM_CREATE_VM should be known/resolved long before the
> > VMM knows it wants to use gmem.
> 
> I am not sure about the benefits of tying gmem creation to any given
> kvm instance.

IMO, making gmem->kvm immutable is very nice to have, e.g. gmem->kvm will always be
valid and the refcounting rules are fairly straightforward.  

> I think the most important requirement here is that a given gmem range is always
> tied to a single VM 

I'm not convinced that that requirement will always hold true (see below).

> This can be enforced when memslots are bound to the gmem files.

Yeah, but TBH, waiting until the guest faults in memory to detect an invalid memslot
is gross.  And looking more closely, taking filemap_invalidate_lock(), i.e. taking
a semaphore for write, in the page fault path is a complete non-starter.  The
"if (existing_slot == slot)" check is likely a non-starter, because KVM handles
FLAGS_ONLY memslot updates, e.g. toggling dirty logging, by duplicating and
replacing the memslot, not by updating the live memslot.

> I believe "Required ordering" is that gmem files are created first and
> then supplied while creating the memslots whose gpa ranges can
> generate private memory accesses.
> Is there any other ordering we want to enforce here?

I wasn't talking about enforcing arbitrary ordering, I was simply talking about
what userspace literally needs to be able to do KVM_CREATE_GUEST_MEMFD.

> > Practically, I think that gives us a clean, intuitive way to handle intra-host
> > migration.  Rather than transfer ownership of the file, instantiate a new file
> > for the target VM, using the gmem inode from the source VM, i.e. create a hard
> > link.  That'd probably require new uAPI, but I don't think that will be hugely
> > problematic.  KVM would need to ensure the new VM's guest_memfd can't be mapped
> > until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify the
> > memslots/bindings are identical), but that should be easy enough to enforce.
> >
> > That way, a VM, its memslots, and its SPTEs are tied to the file, while allowing
> > the memory and the *contents* of memory to outlive the VM, i.e. be effectively
> > transfered to the new target VM.  And we'll maintain the invariant that each
> > guest_memfd is bound 1:1 with a single VM.
> >
> > As above, that should also help us draw the line between mapping memory into a
> > VM (file), and freeing/reclaiming the memory (inode).
> >
> > There will be extra complexity/overhead as we'll have to play nice with the
> > possibility of multiple files per inode, e.g. to zap mappings across all files
> > when punching a hole, but the extra complexity is quite small, e.g. we can use
> > address_space.private_list to keep track of the guest_memfd instances associated
> > with the inode.
> 
> Are we talking about a different usecase of sharing gmem fd across VMs
> other than intra-host migration?

Well, I am :-)  I don't want to build all of this on an assumption that we'll
never ever want to share a guest_memfd across multiple VMs.  E.g. SEV (and SEV-ES?)
already have the migration helper concept, and I've heard more than a few rumblings
of TDX utilizing helper TDs.  IMO, it's not far fetched at all to think that there
will eventually be a need to let multiple VMs share a guest_memfd.

> If not, ideally only one of the files should be catering to the guest
> memory mappings at any given time. i.e. any inode should be ideally
> bound to (through the file) a single kvm instance,

Why?  Honest question, what does it buy us?

For TDX and SNP intra-host migration, it should be easy enough to ensure the new
VM can't create mappings before migration, and that the old VM can't create mappings
or run after migration.  I don't see that being any harder if the source and
dest use different files.

FWIW, it might be easier to hold off on this discussion until I post the RFC
(which is going to happen on Monday at this point), as then we'll have actual code
to discuss.

> as we we are planning to ensure that guest_memfd can't be mapped until
> KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM is invoked on the target side.