Hi Ackerley, On Wed, 12 Feb 2025 at 05:07, Ackerley Tng <ackerleytng@xxxxxxxxxx> wrote: > > Fuad Tabba <tabba@xxxxxxxxxx> writes: > > > Add support for mmap() and fault() for guest_memfd backed memory > > in the host for VMs that support in-place conversion between > > shared and private (shared memory). To that end, this patch adds > > the ability to check whether the VM type has that support, and > > only allows mapping its memory if that's the case. > > > > Additionally, this behavior is gated with a new configuration > > option, CONFIG_KVM_GMEM_SHARED_MEM. > > > > Signed-off-by: Fuad Tabba <tabba@xxxxxxxxxx> > > > > --- > > > > This patch series will allow shared memory support for software > > VMs in x86. It will also introduce a similar VM type for arm64 > > and allow shared memory support for that. In the future, pKVM > > will also support shared memory. > > Thanks, I agree that introducing mmap this way could help in having it > merged earlier, independently of conversion support, to support testing. > > I'll adopt this patch in the next revision of 1G page support for > guest_memfd. > > > --- > > include/linux/kvm_host.h | 11 +++++ > > virt/kvm/Kconfig | 4 ++ > > virt/kvm/guest_memfd.c | 93 ++++++++++++++++++++++++++++++++++++++++ > > 3 files changed, 108 insertions(+) > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > index 8b5f28f6efff..438aa3df3175 100644 > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -728,6 +728,17 @@ static inline bool kvm_arch_has_private_mem(struct kvm *kvm) > > } > > #endif > > > > +/* > > + * Arch code must define kvm_arch_gmem_supports_shared_mem if support for > > + * private memory is enabled and it supports in-place shared/private conversion. > > + */ > > +#if !defined(kvm_arch_gmem_supports_shared_mem) && !IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) > > +static inline bool kvm_arch_gmem_supports_shared_mem(struct kvm *kvm) > > +{ > > + return false; > > +} > > +#endif > > Perhaps this could be declared in the #ifdef CONFIG_KVM_PRIVATE_MEM > block? > > Could this be defined as a __weak symbol for architectures to override? > Or perhaps that can be done once guest_memfd gets refactored separately > since now the entire guest_memfd.c isn't even compiled if > CONFIG_KVM_PRIVATE_MEM is not set. I don't have a strong opinion about this. I did it this way because that's what the existing code for kvm_arch_has_private_mem() is like. It seemed logical to follow that pattern. > > + > > #ifndef kvm_arch_has_readonly_mem > > static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm) > > { > > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig > > index 54e959e7d68f..4e759e8020c5 100644 > > --- a/virt/kvm/Kconfig > > +++ b/virt/kvm/Kconfig > > @@ -124,3 +124,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE > > config HAVE_KVM_ARCH_GMEM_INVALIDATE > > bool > > depends on KVM_PRIVATE_MEM > > + > > +config KVM_GMEM_SHARED_MEM > > + select KVM_PRIVATE_MEM > > + bool > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > > index c6f6792bec2a..85467a3ef8ea 100644 > > --- a/virt/kvm/guest_memfd.c > > +++ b/virt/kvm/guest_memfd.c > > @@ -317,9 +317,102 @@ void kvm_gmem_handle_folio_put(struct folio *folio) > > { > > WARN_ONCE(1, "A placeholder that shouldn't trigger. Work in progress."); > > } > > + > > +static bool kvm_gmem_offset_is_shared(struct file *file, pgoff_t index) > > +{ > > + struct kvm_gmem *gmem = file->private_data; > > + > > + /* For now, VMs that support shared memory share all their memory. */ > > + return kvm_arch_gmem_supports_shared_mem(gmem->kvm); > > +} > > + > > +static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf) > > +{ > > + struct inode *inode = file_inode(vmf->vma->vm_file); > > + struct folio *folio; > > + vm_fault_t ret = VM_FAULT_LOCKED; > > + > > + filemap_invalidate_lock_shared(inode->i_mapping); > > + > > + folio = kvm_gmem_get_folio(inode, vmf->pgoff); > > + if (IS_ERR(folio)) { > > + ret = VM_FAULT_SIGBUS; > > Will it always be a SIGBUS if there is some error getting a folio? I could try to expand it, and map more of the VM_FAULT_* to the potential return values from __filemap_get_folio(), i.e., -EAGAIN -> VM_FAULT_RETRY -ENOMEM -> VM_FAULT_OOM but some don't really map 1:1 if you read the description. For example is it correct to map -ENOENT -> VM_FAULT_NOPAGE /* fault installed the pte, not return page */ That said, it might be useful to map at least EAGAIN and ENOMEM. > > + goto out_filemap; > > + } > > + > > + if (folio_test_hwpoison(folio)) { > > + ret = VM_FAULT_HWPOISON; > > + goto out_folio; > > + } > > + > > + /* Must be called with folio lock held, i.e., after kvm_gmem_get_folio() */ > > + if (!kvm_gmem_offset_is_shared(vmf->vma->vm_file, vmf->pgoff)) { > > + ret = VM_FAULT_SIGBUS; > > + goto out_folio; > > + } > > + > > + /* > > + * Only private folios are marked as "guestmem" so far, and we never > > + * expect private folios at this point. > > + */ > > Proposal - rephrase this comment as: before typed folios can be mapped, > PGTY_guestmem is only tagged on folios so that guest_memfd will receive > the kvm_gmem_handle_folio_put() callback. The tag is definitely not > expected when a folio is about to be faulted in. > > I propose the above because I think technically when mappability is NONE > the folio isn't private? Not sure if others see this differently. A folio whose mappability is NONE is private as far as the host is concerned. That said, I can rephrase this for clarity. > > + if (WARN_ON_ONCE(folio_test_guestmem(folio))) { > > + ret = VM_FAULT_SIGBUS; > > + goto out_folio; > > + } > > + > > + /* No support for huge pages. */ > > + if (WARN_ON_ONCE(folio_test_large(folio))) { > > + ret = VM_FAULT_SIGBUS; > > + goto out_folio; > > + } > > + > > + if (!folio_test_uptodate(folio)) { > > + clear_highpage(folio_page(folio, 0)); > > + kvm_gmem_mark_prepared(folio); > > + } > > + > > + vmf->page = folio_file_page(folio, vmf->pgoff); > > + > > +out_folio: > > + if (ret != VM_FAULT_LOCKED) { > > + folio_unlock(folio); > > + folio_put(folio); > > + } > > + > > +out_filemap: > > + filemap_invalidate_unlock_shared(inode->i_mapping); > > + > > + return ret; > > +} > > + > > +static const struct vm_operations_struct kvm_gmem_vm_ops = { > > + .fault = kvm_gmem_fault, > > +}; > > + > > +static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma) > > +{ > > + struct kvm_gmem *gmem = file->private_data; > > + > > + if (!kvm_arch_gmem_supports_shared_mem(gmem->kvm)) > > + return -ENODEV; > > + > > + if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) != > > + (VM_SHARED | VM_MAYSHARE)) { > > + return -EINVAL; > > + } > > + > > + file_accessed(file); > > + vm_flags_set(vma, VM_DONTDUMP); > > + vma->vm_ops = &kvm_gmem_vm_ops; > > + > > + return 0; > > +} > > +#else > > +#define kvm_gmem_mmap NULL > > #endif /* CONFIG_KVM_GMEM_SHARED_MEM */ > > > > static struct file_operations kvm_gmem_fops = { > > + .mmap = kvm_gmem_mmap, > > I think it's better to surround this with #ifdef > CONFIG_KVM_GMEM_SHARED_MEM so that when more code gets inserted between > the struct declaration and the definition of kvm_gmem_mmap() it is more > obvious that .mmap is only overridden when CONFIG_KVM_GMEM_SHARED_MEM is > set. This is a question of style, but I prefer this because it avoids having more ifdefs. I think that I've seen this pattern elsewhere in the kernel. That said, if others disagree, I'm happy to change this. Thank you, /fuad > > > .open = generic_file_open, > > .release = kvm_gmem_release, > > .fallocate = kvm_gmem_fallocate,