On Mon, 2021-11-15 at 23:22 +0000, David Woodhouse wrote: > On Mon, 2021-11-15 at 22:59 +0000, Sean Christopherson wrote: > > On Mon, Nov 15, 2021, Paolo Bonzini wrote: > > > On 11/15/21 20:11, David Woodhouse wrote: > > > > > Changing mn_memslots_update_rcuwait to a waitq (and renaming it to > > > > > mn_invalidate_waitq) is of course also a possibility. > > > > I suspect that's the answer. > > > > > > > > I think the actual*invalidation* of the cache still lives in the > > > > invalidate_range() callback where I have it at the moment. > > > > Oooh! [finally had a lightbulb moment about ->invalidate_range() after years of > > befuddlement]. > > > > Two things: > > > > 1. Using _only_ ->invalidate_range() is not correct. ->invalidate_range() is > > required if and only if the old PFN needs to be _unmapped_. Specifically, > > if the protections are being downgraded without changing the PFN, it doesn't > > need to be called. E.g. from hugetlb_change_protection(): > > OK, that's kind of important to realise. Thanks. > > So, I had just split the atomic and guest-mode invalidations apart: > https://git.infradead.org/users/dwmw2/linux.git/commitdiff/6cf5fe318fd > but will go back to doing it all in invalidate_range_start from a > single list. > > And just deal with the fact that the atomic users now have to > loop/retry/wait for there *not* to be an MMU notification in progress. > > > I believe we could use ->invalidate_range() to handle the unmap case if KVM's > > ->invalidate_range_start() hook is enhanced to handle the RW=>R case. The > > "struct mmu_notifier_range" provides the event type, IIUC we could have the > > _start() variant handle MMU_NOTIFY_PROTECTION_{VMA,PAGE} (and maybe > > MMU_NOTIFY_SOFT_DIRTY?), and let the more precise unmap-only variant handle > > everything else. > > Not sure that helps us much. It was the termination condition on the > "when should we keep retrying, and when should we give up?" that was > painful, and a mixed mode doesn't that problem it go away. > > I'll go back and have another look in the morning, with something much > closer to what I showed in > https://lore.kernel.org/kvm/040d61dad066eb2517c108232efb975bc1cda780.camel@xxxxxxxxxxxxx/ > Looks a bit like this, and it seems to be working for the Xen event channel self-test. I'll port it into our actual Xen hosting environment and give it some more serious testing. I'm not sure I'm ready to sign up to immediately fix everything that's hosed in nesting and kill off all users of the unsafe kvm_vcpu_map(), but I'll at least convert one vCPU user to demonstrate that the new gfn_to_pfn_cache is working sanely for that use case. From: David Woodhouse <dwmw@xxxxxxxxxxxx> Subject: [PATCH 08/10] KVM: Reinstate gfn_to_pfn_cache with invalidation support This can be used in two modes. There is an atomic mode where the cached mapping is accessed while holding the rwlock, and a mode where the physical address is used by a vCPU in guest mode. For the latter case, an invalidation will wake the vCPU with the new KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any caches it still needs to access before entering guest mode again. Only one vCPU can be targeted by the wake requests; it's simple enough to make it wake all vCPUs or even a mask but I don't see a use case for that additional complexity right now. Invalidation happens from the invalidate_range_start MMU notifier, which needs to be able to sleep in order to wake the vCPU and wait for it. This means that revalidation potentially needs to "wait" for the MMU operation to complete and the invalidate_range_end notifier to be invoked. Like the vCPU when it takes a page fault in that period, we just spin — fixing that in a future patch by implementing an actual *wait* may be another part of shaving this particularly hirsute yak. As noted in the comments in the function itself, the only case where the invalidate_range_start notifier is expected to be called *without* being able to sleep is when the OOM reaper is killing the process. In that case, we expect the vCPU threads already to have exited, and thus there will be nothing to wake, and no reason to wait. So we clear the KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly if there actually *was* anything to wake up. Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx> --- arch/x86/kvm/Kconfig | 1 + include/linux/kvm_host.h | 14 ++ include/linux/kvm_types.h | 17 ++ virt/kvm/Kconfig | 3 + virt/kvm/Makefile.kvm | 1 + virt/kvm/dirty_ring.c | 2 +- virt/kvm/kvm_main.c | 13 +- virt/kvm/{mmu_lock.h => kvm_mm.h} | 23 ++- virt/kvm/pfncache.c | 275 ++++++++++++++++++++++++++++++ 9 files changed, 342 insertions(+), 7 deletions(-) rename virt/kvm/{mmu_lock.h => kvm_mm.h} (55%) create mode 100644 virt/kvm/pfncache.c diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index d7fa0a42ac25..af351107d47f 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -26,6 +26,7 @@ config KVM select PREEMPT_NOTIFIERS select MMU_NOTIFIER select HAVE_KVM_IRQCHIP + select HAVE_KVM_PFNCACHE select HAVE_KVM_IRQFD select HAVE_KVM_DIRTY_RING select IRQ_BYPASS_MANAGER diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c310648cc8f1..52e17e4b7694 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -151,6 +151,7 @@ static inline bool is_error_page(struct page *page) #define KVM_REQ_UNBLOCK 2 #define KVM_REQ_UNHALT 3 #define KVM_REQ_VM_DEAD (4 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) +#define KVM_REQ_GPC_INVALIDATE (5 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) #define KVM_REQUEST_ARCH_BASE 8 #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \ @@ -559,6 +560,10 @@ struct kvm { unsigned long mn_active_invalidate_count; struct rcuwait mn_memslots_update_rcuwait; + /* For management / invalidation of gfn_to_pfn_caches */ + spinlock_t gpc_lock; + struct list_head gpc_list; + /* * created_vcpus is protected by kvm->lock, and is incremented * at the beginning of KVM_CREATE_VCPU. online_vcpus is only @@ -966,6 +971,15 @@ int kvm_vcpu_write_guest(struct kvm_vcpu *vcpu, gpa_t gpa, const void *data, unsigned long len); void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn); +int kvm_gfn_to_pfn_cache_init(struct kvm *kvm, struct gfn_to_pfn_cache *gpc, + struct kvm_vcpu *vcpu, bool kernel_map, + gpa_t gpa, unsigned long len, bool write); +int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc, + gpa_t gpa, unsigned long len, bool write); +bool kvm_gfn_to_pfn_cache_check(struct kvm *kvm, struct gfn_to_pfn_cache *gpc, + gpa_t gpa, unsigned long len); +void kvm_gfn_to_pfn_cache_destroy(struct kvm *kvm, struct gfn_to_pfn_cache *gpc); + void kvm_sigset_activate(struct kvm_vcpu *vcpu); void kvm_sigset_deactivate(struct kvm_vcpu *vcpu); diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h index 234eab059839..e454d2c003d6 100644 --- a/include/linux/kvm_types.h +++ b/include/linux/kvm_types.h @@ -19,6 +19,7 @@ struct kvm_memslots; enum kvm_mr_change; #include <linux/types.h> +#include <linux/spinlock_types.h> #include <asm/kvm_types.h> @@ -53,6 +54,22 @@ struct gfn_to_hva_cache { struct kvm_memory_slot *memslot; }; +struct gfn_to_pfn_cache { + u64 generation; + gpa_t gpa; + unsigned long uhva; + struct kvm_memory_slot *memslot; + struct kvm_vcpu *vcpu; + struct list_head list; + rwlock_t lock; + void *khva; + kvm_pfn_t pfn; + bool active; + bool valid; + bool dirty; + bool kernel_map; +}; + #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE /* * Memory caches are used to preallocate memory ahead of various MMU flows, diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index 97cf5413ac25..f4834c20e4a6 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -4,6 +4,9 @@ config HAVE_KVM bool +config HAVE_KVM_PFNCACHE + bool + config HAVE_KVM_IRQCHIP bool diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm index ee9c310f3601..ca499a216d0f 100644 --- a/virt/kvm/Makefile.kvm +++ b/virt/kvm/Makefile.kvm @@ -11,3 +11,4 @@ kvm-$(CONFIG_KVM_MMIO) += $(KVM)/coalesced_mmio.o kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o kvm-$(CONFIG_HAVE_KVM_IRQCHIP) += $(KVM)/irqchip.o kvm-$(CONFIG_HAVE_KVM_DIRTY_RING) += $(KVM)/dirty_ring.o +kvm-$(CONFIG_HAVE_KVM_PFNCACHE) += $(KVM)/pfncache.o diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c index 88f4683198ea..2b4474387895 100644 --- a/virt/kvm/dirty_ring.c +++ b/virt/kvm/dirty_ring.c @@ -9,7 +9,7 @@ #include <linux/vmalloc.h> #include <linux/kvm_dirty_ring.h> #include <trace/events/kvm.h> -#include "mmu_lock.h" +#include "kvm_mm.h" int __weak kvm_cpu_dirty_log_size(void) { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 356d636e037d..85506e4bd145 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -59,7 +59,7 @@ #include "coalesced_mmio.h" #include "async_pf.h" -#include "mmu_lock.h" +#include "kvm_mm.h" #include "vfio.h" #define CREATE_TRACE_POINTS @@ -684,6 +684,9 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, kvm->mn_active_invalidate_count++; spin_unlock(&kvm->mn_invalidate_lock); + gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end, + hva_range.may_block); + __kvm_handle_hva_range(kvm, &hva_range); return 0; @@ -1051,6 +1054,9 @@ static struct kvm *kvm_create_vm(unsigned long type) spin_lock_init(&kvm->mn_invalidate_lock); rcuwait_init(&kvm->mn_memslots_update_rcuwait); + INIT_LIST_HEAD(&kvm->gpc_list); + spin_lock_init(&kvm->gpc_lock); + INIT_LIST_HEAD(&kvm->devices); BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX); @@ -2390,8 +2396,8 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, * 2): @write_fault = false && @writable, @writable will tell the caller * whether the mapping is writable. */ -static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async, - bool write_fault, bool *writable) +kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async, + bool write_fault, bool *writable) { struct vm_area_struct *vma; kvm_pfn_t pfn = 0; diff --git a/virt/kvm/mmu_lock.h b/virt/kvm/kvm_mm.h similarity index 55% rename from virt/kvm/mmu_lock.h rename to virt/kvm/kvm_mm.h index 9e1308f9734c..b976e4b07e88 100644 --- a/virt/kvm/mmu_lock.h +++ b/virt/kvm/kvm_mm.h @@ -1,7 +1,7 @@ // SPDX-License-Identifier: GPL-2.0-only -#ifndef KVM_MMU_LOCK_H -#define KVM_MMU_LOCK_H 1 +#ifndef __KVM_MM_H__ +#define __KVM_MM_H__ 1 /* * Architectures can choose whether to use an rwlock or spinlock @@ -20,4 +20,21 @@ #define KVM_MMU_UNLOCK(kvm) spin_unlock(&(kvm)->mmu_lock) #endif /* KVM_HAVE_MMU_RWLOCK */ -#endif +kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async, + bool write_fault, bool *writable); + +#ifdef CONFIG_HAVE_KVM_PFNCACHE +void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, + unsigned long start, + unsigned long end, + bool may_block); +#else +static inline void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, + unsigned long start, + unsigned long end, + bool may_block) +{ +} +#endif /* HAVE_KVM_PFNCACHE */ + +#endif /* __KVM_MM_H__ */ diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c new file mode 100644 index 000000000000..f2efc52039a8 --- /dev/null +++ b/virt/kvm/pfncache.c @@ -0,0 +1,275 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Kernel-based Virtual Machine driver for Linux + * + * This module enables kernel and guest-mode vCPU access to guest physical + * memory with suitable invalidation mechanisms. + * + * Copyright © 2021 Amazon.com, Inc. or its affiliates. + * + * Authors: + * David Woodhouse <dwmw2@xxxxxxxxxxxxx> + */ + +#include <linux/kvm_host.h> +#include <linux/kvm.h> +#include <linux/highmem.h> +#include <linux/module.h> +#include <linux/errno.h> + +#include "kvm_mm.h" + +/* + * MMU notifier 'invalidate_range_start' hook. + */ +void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start, + unsigned long end, bool may_block) +{ + DECLARE_BITMAP(vcpu_bitmap, KVM_MAX_VCPUS); + struct gfn_to_pfn_cache *gpc; + bool wake_vcpus = false; + + spin_lock(&kvm->gpc_lock); + list_for_each_entry(gpc, &kvm->gpc_list, list) { + write_lock_irq(&gpc->lock); + + /* Only a single page so no need to care about length */ + if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) && + gpc->uhva >= start && gpc->uhva < end) { + gpc->valid = false; + + if (gpc->dirty) { + int idx = srcu_read_lock(&kvm->srcu); + mark_page_dirty(kvm, gpa_to_gfn(gpc->gpa)); + srcu_read_unlock(&kvm->srcu, idx); + + kvm_set_pfn_dirty(gpc->pfn); + gpc->dirty = false; + } + + /* + * If a guest vCPU could be using the physical address, + * it needs to be woken. + */ + if (gpc->vcpu) { + if (!wake_vcpus) { + wake_vcpus = true; + bitmap_zero(vcpu_bitmap, KVM_MAX_VCPUS); + } + __set_bit(gpc->vcpu->vcpu_idx, vcpu_bitmap); + } + } + write_unlock_irq(&gpc->lock); + } + spin_unlock(&kvm->gpc_lock); + + if (wake_vcpus) { + unsigned int req = KVM_REQ_GPC_INVALIDATE; + bool called; + + /* + * If the OOM reaper is active, then all vCPUs should have + * been stopped already, so perform the request without + * KVM_REQUEST_WAIT and be sad if any needed to be woken. + */ + if (!may_block) + req &= ~KVM_REQUEST_WAIT; + + called = kvm_make_vcpus_request_mask(kvm, req, vcpu_bitmap); + + WARN_ON_ONCE(called && !may_block); + } +} + +bool kvm_gfn_to_pfn_cache_check(struct kvm *kvm, struct gfn_to_pfn_cache *gpc, + gpa_t gpa, unsigned long len) +{ + struct kvm_memslots *slots = kvm_memslots(kvm); + + if ((gpa & ~PAGE_MASK) + len > PAGE_SIZE) + return false; + + if (gpc->gpa != gpa || gpc->generation != slots->generation || + kvm_is_error_hva(gpc->uhva)) + return false; + + if (!gpc->valid) + return false; + + return true; +} +EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_check); + +int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc, + gpa_t gpa, unsigned long len, bool write) +{ + struct kvm_memslots *slots = kvm_memslots(kvm); + unsigned long page_offset = gpa & ~PAGE_MASK; + kvm_pfn_t old_pfn, new_pfn; + unsigned long old_uhva; + gpa_t old_gpa; + void *old_khva; + bool old_valid, old_dirty; + int ret = 0; + + /* + * If must fit within a single page. The 'len' argument is + * only to enforce that. + */ + if (page_offset + len > PAGE_SIZE) + return -EINVAL; + + write_lock_irq(&gpc->lock); + + old_gpa = gpc->gpa; + old_pfn = gpc->pfn; + old_khva = gpc->khva; + old_uhva = gpc->uhva; + old_valid = gpc->valid; + old_dirty = gpc->dirty; + + /* If the userspace HVA is invalid, refresh that first */ + if (gpc->gpa != gpa || gpc->generation != slots->generation || + kvm_is_error_hva(gpc->uhva)) { + gfn_t gfn = gpa_to_gfn(gpa); + + gpc->dirty = false; + gpc->gpa = gpa; + gpc->generation = slots->generation; + gpc->memslot = __gfn_to_memslot(slots, gfn); + gpc->uhva = gfn_to_hva_memslot(gpc->memslot, gfn); + + if (kvm_is_error_hva(gpc->uhva)) { + ret = -EFAULT; + goto out; + } + + gpc->uhva += page_offset; + } + + /* + * If the userspace HVA changed or the PFN was already invalid, + * drop the lock and do the HVA to PFN lookup again. + */ + if (!old_valid || old_uhva != gpc->uhva) { + unsigned long uhva = gpc->uhva; + void *new_khva = NULL; + unsigned long mmu_seq; + int retry; + + /* Placeholders for "hva is valid but not yet mapped" */ + gpc->pfn = KVM_PFN_ERR_FAULT; + gpc->khva = NULL; + gpc->valid = true; + + write_unlock_irq(&gpc->lock); + + retry_map: + mmu_seq = kvm->mmu_notifier_seq; + smp_rmb(); + + new_pfn = hva_to_pfn(uhva, false, NULL, true, NULL); + if (is_error_noslot_pfn(new_pfn)) { + ret = -EFAULT; + goto map_done; + } + + read_lock(&kvm->mmu_lock); + retry = mmu_notifier_retry_hva(kvm, mmu_seq, uhva); + read_unlock(&kvm->mmu_lock); + if (retry) { + cond_resched(); + goto retry_map; + } + + if (gpc->kernel_map) { + if (new_pfn == old_pfn) { + new_khva = (void *)((unsigned long)old_khva - page_offset); + old_pfn = KVM_PFN_ERR_FAULT; + old_khva = NULL; + } else if (pfn_valid(new_pfn)) { + new_khva = kmap(pfn_to_page(new_pfn)); +#ifdef CONFIG_HAS_IOMEM + } else { + new_khva = memremap(pfn_to_hpa(new_pfn), PAGE_SIZE, MEMREMAP_WB); +#endif + } + if (!new_khva) + ret = -EFAULT; + } + + map_done: + write_lock_irq(&gpc->lock); + if (ret) { + gpc->valid = false; + gpc->pfn = KVM_PFN_ERR_FAULT; + gpc->khva = NULL; + } else { + /* At this point, gpc->valid may already have been cleared */ + gpc->pfn = new_pfn; + gpc->khva = new_khva + page_offset; + } + } + + out: + if (ret) + gpc->dirty = false; + else + gpc->dirty = write; + + write_unlock_irq(&gpc->lock); + + /* Unmap the old page if it was mapped before */ + if (!is_error_noslot_pfn(old_pfn)) { + if (pfn_valid(old_pfn)) { + kunmap(pfn_to_page(old_pfn)); +#ifdef CONFIG_HAS_IOMEM + } else { + memunmap(old_khva); +#endif + } + kvm_release_pfn(old_pfn, old_dirty); + if (old_dirty) + mark_page_dirty(kvm, old_gpa); + } + + return ret; +} +EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_refresh); + +int kvm_gfn_to_pfn_cache_init(struct kvm *kvm, struct gfn_to_pfn_cache *gpc, + struct kvm_vcpu *vcpu, bool kernel_map, + gpa_t gpa, unsigned long len, bool write) +{ + if (!gpc->active) { + rwlock_init(&gpc->lock); + + gpc->khva = NULL; + gpc->pfn = KVM_PFN_ERR_FAULT; + gpc->uhva = KVM_HVA_ERR_BAD; + gpc->vcpu = vcpu; + gpc->kernel_map = kernel_map; + gpc->valid = false; + gpc->active = true; + + spin_lock(&kvm->gpc_lock); + list_add(&gpc->list, &kvm->gpc_list); + spin_unlock(&kvm->gpc_lock); + } + return kvm_gfn_to_pfn_cache_refresh(kvm, gpc, gpa, len, write); +} +EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_init); + +void kvm_gfn_to_pfn_cache_destroy(struct kvm *kvm, struct gfn_to_pfn_cache *gpc) +{ + if (gpc->active) { + spin_lock(&kvm->gpc_lock); + list_del(&gpc->list); + spin_unlock(&kvm->gpc_lock); + + /* In failing, it will tear down any existing mapping */ + (void)kvm_gfn_to_pfn_cache_refresh(kvm, gpc, GPA_INVALID, 0, false); + gpc->active = false; + } +} +EXPORT_SYMBOL_GPL(kvm_gfn_to_pfn_cache_destroy); -- 2.31.1
Attachment:
smime.p7s
Description: S/MIME cryptographic signature