Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM)

Ackerley Tng <ackerleytng@xxxxxxxxxx> · Thu, 13 Jul 2023 22:46:11 +0000

Sean Christopherson <seanjc@xxxxxxxxxx> writes:

> On Fri, May 05, 2023, Ackerley Tng wrote:
>>
>> Hi Sean,
>>
>> Thanks for implementing this POC!
>>
>> ... snip ...
>>
>
> I don't love either approach idea because it means a file created in the context
> of a VM can outlive the VM itself, and then userspace ends up with a file descriptor
> that it can't do anything with except close().  I doubt that matters in practice
> though, e.g. when the VM dies, all memory can be freed so that the file ends up
> being little more than a shell.  And if we go that route, there's no need to grab
> a reference to the file during bind, KVM can just grab a longterm reference when
> the file is initially created and then drop it when KVM dies (and nullifies gmem->kvm).
>
> ... snip ...
>
> My preference is to make it a VM-scoped ioctl(), if it ends up being a KVM ioctl()
> and not a common syscall.  If the file isn't tightly coupled to a single VM, then
> punching a hole is further complicated by needing to deal with invalidating multiple
> regions that are bound to different @kvm instances.  It's not super complex, but
> AFAICT having the ioctl() be system-scoped doesn't add value, e.g. I don't think
> having one VM own the memory will complicate even if/when we get to the point where
> VMs can share "private" memory, and the gmem code would still need to deal with
> grabbing a module reference.

I’d like to follow up on this discussion about a guest_mem file
outliving the VM and whether to have a VM-scoped ioctl or a KVM ioctl.

Here's a POC of delayed binding of a guest_mem file to a memslot, where
the guest_mem file outlives the VM [1].

I also hope to raise some points before we do the first integration of
guest_mem patches!

A use case for guest_mem inodes outliving the VM is when the host VMM
needs to be upgraded. The guest_mem inode is passed between two VMs on
the same host machine and all memory associated with the inode needs to
be retained.

To support the above use case, binding of memslots is delayed until
first use, so that the following inode passing flow can be used:

1. Source (old version of host VMM) process passes guest_mem inodes to
   destination (new version of host VMM) process via unix sockets.
2. Destination process initializes memslots identical to source process.
3. Destination process invokes ioctl to migrate guest_mem inode over to
   destination process by unbinding all memslots from the source VM and
   binding them to the destination VM. (The kvm pointer is updated in
   this process)

Without delayed binding, step 2 will fail since initialization of
memslots would check and find that the kvm pointer in the guest_mem
inode points to the kvm in the source process.

These two patches contain the meat of the changes required to support
delayed binding:

https://github.com/googleprodkernel/linux-cc/commit/93b31a006ef2e4dbe1ef0ec5d2534ca30f3bf60c
https://github.com/googleprodkernel/linux-cc/commit/dd5ac5e53f14a1ef9915c9c1e4cc1006a40b49df

Some things to highlight for the approach set out in these two patches:

1. Previously, closing the guest_mem file in userspace is taken to mean
   that all associated memory is to be removed and cleared. With these
   two patches, each memslot also holds a reference to the file (and
   therefore inode) and so even if the host VMM closes the fd, the VM
   will be able to continue to function.

   This is desirable to userspace since closing the file should not be
   interpreted as a command to clear memory. This is aligned with the
   way tmpfs files are used with KVM before guest_mem: when the file is
   closed in userspace, the memory contents are still mapped and can
   still be used by the VM. fallocate(PUNCH_HOLE) is how userspace
   should command memory to be removed, just like munmap() would be used
   to remove memory from use by KVM.

2. Creating a guest mem file no longer depends on a specific VM and
   hence the guest_mem creation ioctl can be a system ioctl instead of a
   VM specific ioctl. This will also address Chao's concern at [3].

I also separated cleaning up files vs inodes in
https://github.com/googleprodkernel/linux-cc/commit/0f5aa18910c515141e57e05c4cc791022047a242,
which I believe is more aligned with how files and inodes are cleaned up
in FSes. This alignment makes it easier to extend gmem to hugetlb, for
one. While working on this, I was also wondering if we should perhaps be
storing the inode pointer in slot->gmem instead of the file pointer? The
memory is associated with an inode->mapping rather than the file. Are we
binding to a userspace handle on the inode (store file pointer), or are
we really referencing the inode (store inode pointer)?

The branch [1] doesn't handle the bug Sean previously mentioned at [2]:
Need to take a reference on the KVM module, so that even if guest_mem
files are not bound to any VM, the KVM module cannot be unloaded. If the
KVM module can be unloaded while guest_mem files are open, then
userspace may be able to crash the kernel by invoking guest_mem
functions that had been unloaded together with KVM.

[1] https://github.com/googleprodkernel/linux-cc/tree/gmem-delayed-binding
[2] https://lore.kernel.org/lkml/ZFWli2%2FH5M8MZRiY@xxxxxxxxxx/
[3] https://lore.kernel.org/lkml/20230509124428.GA217130@xxxxxxxxxxxxxxxxxx/