Sean Christopherson <seanjc@xxxxxxxxxx> writes: > On Fri, May 05, 2023, Ackerley Tng wrote: >> >> Hi Sean, >> >> Thanks for implementing this POC! >> >> ... snip ... >> > > I don't love either approach idea because it means a file created in the context > of a VM can outlive the VM itself, and then userspace ends up with a file descriptor > that it can't do anything with except close(). I doubt that matters in practice > though, e.g. when the VM dies, all memory can be freed so that the file ends up > being little more than a shell. And if we go that route, there's no need to grab > a reference to the file during bind, KVM can just grab a longterm reference when > the file is initially created and then drop it when KVM dies (and nullifies gmem->kvm). > > ... snip ... > > My preference is to make it a VM-scoped ioctl(), if it ends up being a KVM ioctl() > and not a common syscall. If the file isn't tightly coupled to a single VM, then > punching a hole is further complicated by needing to deal with invalidating multiple > regions that are bound to different @kvm instances. It's not super complex, but > AFAICT having the ioctl() be system-scoped doesn't add value, e.g. I don't think > having one VM own the memory will complicate even if/when we get to the point where > VMs can share "private" memory, and the gmem code would still need to deal with > grabbing a module reference. I’d like to follow up on this discussion about a guest_mem file outliving the VM and whether to have a VM-scoped ioctl or a KVM ioctl. Here's a POC of delayed binding of a guest_mem file to a memslot, where the guest_mem file outlives the VM [1]. I also hope to raise some points before we do the first integration of guest_mem patches! A use case for guest_mem inodes outliving the VM is when the host VMM needs to be upgraded. The guest_mem inode is passed between two VMs on the same host machine and all memory associated with the inode needs to be retained. To support the above use case, binding of memslots is delayed until first use, so that the following inode passing flow can be used: 1. Source (old version of host VMM) process passes guest_mem inodes to destination (new version of host VMM) process via unix sockets. 2. Destination process initializes memslots identical to source process. 3. Destination process invokes ioctl to migrate guest_mem inode over to destination process by unbinding all memslots from the source VM and binding them to the destination VM. (The kvm pointer is updated in this process) Without delayed binding, step 2 will fail since initialization of memslots would check and find that the kvm pointer in the guest_mem inode points to the kvm in the source process. These two patches contain the meat of the changes required to support delayed binding: https://github.com/googleprodkernel/linux-cc/commit/93b31a006ef2e4dbe1ef0ec5d2534ca30f3bf60c https://github.com/googleprodkernel/linux-cc/commit/dd5ac5e53f14a1ef9915c9c1e4cc1006a40b49df Some things to highlight for the approach set out in these two patches: 1. Previously, closing the guest_mem file in userspace is taken to mean that all associated memory is to be removed and cleared. With these two patches, each memslot also holds a reference to the file (and therefore inode) and so even if the host VMM closes the fd, the VM will be able to continue to function. This is desirable to userspace since closing the file should not be interpreted as a command to clear memory. This is aligned with the way tmpfs files are used with KVM before guest_mem: when the file is closed in userspace, the memory contents are still mapped and can still be used by the VM. fallocate(PUNCH_HOLE) is how userspace should command memory to be removed, just like munmap() would be used to remove memory from use by KVM. 2. Creating a guest mem file no longer depends on a specific VM and hence the guest_mem creation ioctl can be a system ioctl instead of a VM specific ioctl. This will also address Chao's concern at [3]. I also separated cleaning up files vs inodes in https://github.com/googleprodkernel/linux-cc/commit/0f5aa18910c515141e57e05c4cc791022047a242, which I believe is more aligned with how files and inodes are cleaned up in FSes. This alignment makes it easier to extend gmem to hugetlb, for one. While working on this, I was also wondering if we should perhaps be storing the inode pointer in slot->gmem instead of the file pointer? The memory is associated with an inode->mapping rather than the file. Are we binding to a userspace handle on the inode (store file pointer), or are we really referencing the inode (store inode pointer)? The branch [1] doesn't handle the bug Sean previously mentioned at [2]: Need to take a reference on the KVM module, so that even if guest_mem files are not bound to any VM, the KVM module cannot be unloaded. If the KVM module can be unloaded while guest_mem files are open, then userspace may be able to crash the kernel by invoking guest_mem functions that had been unloaded together with KVM. [1] https://github.com/googleprodkernel/linux-cc/tree/gmem-delayed-binding [2] https://lore.kernel.org/lkml/ZFWli2%2FH5M8MZRiY@xxxxxxxxxx/ [3] https://lore.kernel.org/lkml/20230509124428.GA217130@xxxxxxxxxxxxxxxxxx/