Please ignore this email. It was sent as a mistake. > -----Original Message----- > From: Ang, Tien Sung <tien.sung.ang@xxxxxxxxx> > Sent: Friday, 14 July, 2023 11:33 AM > To: Ang, Tien Sung <tien.sung.ang@xxxxxxxxx> > Cc: Gavin Shan <gshan@xxxxxxxxxx>; stable@xxxxxxxxxxxxxxx; Shuai Hu > <hshuai@xxxxxxxxxx>; Zhenyu Zhang <zhenyzha@xxxxxxxxxx>; David > Hildenbrand <david@xxxxxxxxxx>; Oliver Upton <oliver.upton@xxxxxxxxx>; > Peter Xu <peterx@xxxxxxxxxx>; Christopherson,, Sean > <seanjc@xxxxxxxxxx>; Shaoqin Huang <shahuang@xxxxxxxxxx>; Paolo > Bonzini <pbonzini@xxxxxxxxxx> > Subject: [PATCH 1/2] KVM: Avoid illegal stage2 mapping on invalid memory > slot > > From: Gavin Shan <gshan@xxxxxxxxxx> > > We run into guest hang in edk2 firmware when KSM is kept as running on the > host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash device > (TYPE_PFLASH_CFI01) during the operation of sector erasing or buffered > write. The status is returned by reading the memory region of the pflash > device and the read request should have been forwarded to QEMU and > emulated by it. Unfortunately, the read request is covered by an illegal > stage2 mapping when the guest hang issue occurs. The read request is > completed with QEMU bypassed and wrong status is fetched. The edk2 > firmware runs into an infinite loop with the wrong status. > > The illegal stage2 mapping is populated due to same page sharing by KSM at > (C) even the associated memory slot has been marked as invalid at (B) when > the memory slot is requested to be deleted. It's notable that the active and > inactive memory slots can't be swapped when we're in the middle of > kvm_mmu_notifier_change_pte() because kvm- > >mn_active_invalidate_count is elevated, and > kvm_swap_active_memslots() will busy loop until it reaches to zero again. > Besides, the swapping from the active to the inactive memory slots is also > avoided by holding &kvm->srcu in __kvm_handle_hva_range(), > corresponding to synchronize_srcu_expedited() in > kvm_swap_active_memslots(). > > CPU-A CPU-B > ----- ----- > ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION) > kvm_vm_ioctl_set_memory_region > kvm_set_memory_region > __kvm_set_memory_region > kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE) > kvm_invalidate_memslot > kvm_copy_memslot > kvm_replace_memslot > kvm_swap_active_memslots (A) > kvm_arch_flush_shadow_memslot (B) > same page sharing by KSM > kvm_mmu_notifier_invalidate_range_start > : > kvm_mmu_notifier_change_pte > kvm_handle_hva_range > __kvm_handle_hva_range > kvm_set_spte_gfn (C) > : > kvm_mmu_notifier_invalidate_range_end > > Fix the issue by skipping the invalid memory slot at (C) to avoid the illegal > stage2 mapping so that the read request for the pflash's status is forwarded > to QEMU and emulated by it. In this way, the correct pflash's status can be > returned from QEMU to break the infinite loop in the edk2 firmware. > > We tried a git-bisect and the first problematic commit is cd4c71835228 (" > KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this, > clean_dcache_guest_page() is called after the memory slots are iterated in > kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called > before the iteration on the memory slots before this commit. This change > literally enlarges the racy window between > kvm_mmu_notifier_change_pte() and memory slot removal so that we're > able to reproduce the issue in a practical test case. However, the issue exists > since commit d5d8184d35c9 > ("KVM: ARM: Memory virtualization setup"). > > Cc: stable@xxxxxxxxxxxxxxx # v3.9+ > Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") > Reported-by: Shuai Hu <hshuai@xxxxxxxxxx> > Reported-by: Zhenyu Zhang <zhenyzha@xxxxxxxxxx> > Signed-off-by: Gavin Shan <gshan@xxxxxxxxxx> > Reviewed-by: David Hildenbrand <david@xxxxxxxxxx> > Reviewed-by: Oliver Upton <oliver.upton@xxxxxxxxx> > Reviewed-by: Peter Xu <peterx@xxxxxxxxxx> > Reviewed-by: Sean Christopherson <seanjc@xxxxxxxxxx> > Reviewed-by: Shaoqin Huang <shahuang@xxxxxxxxxx> > Message-Id: <20230615054259.14911-1-gshan@xxxxxxxxxx> > Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> > --- > virt/kvm/kvm_main.c | 20 +++++++++++++++++++- > 1 file changed, 19 insertions(+), 1 deletion(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index > 479802a892d4..65f94f592ff8 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -686,6 +686,24 @@ static __always_inline int > kvm_handle_hva_range_no_flush(struct mmu_notifier *mn > > return __kvm_handle_hva_range(kvm, &range); } > + > +static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range > +*range) { > + /* > + * Skipping invalid memslots is correct if and only change_pte() is > + * surrounded by invalidate_range_{start,end}(), which is currently > + * guaranteed by the primary MMU. If that ever changes, KVM > needs to > + * unmap the memslot instead of skipping the memslot to ensure > that KVM > + * doesn't hold references to the old PFN. > + */ > + WARN_ON_ONCE(!READ_ONCE(kvm- > >mn_active_invalidate_count)); > + > + if (range->slot->flags & KVM_MEMSLOT_INVALID) > + return false; > + > + return kvm_set_spte_gfn(kvm, range); > +} > + > static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, > struct mm_struct *mm, > unsigned long address, > @@ -707,7 +725,7 @@ static void kvm_mmu_notifier_change_pte(struct > mmu_notifier *mn, > if (!READ_ONCE(kvm->mmu_invalidate_in_progress)) > return; > > - kvm_handle_hva_range(mn, address, address + 1, pte, > kvm_set_spte_gfn); > + kvm_handle_hva_range(mn, address, address + 1, pte, > +kvm_change_spte_gfn); > } > > void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start, > -- > 2.25.1