On Wed, 23 Aug 2023, Sean Christopherson wrote: > On Wed, Aug 23, 2023, Eric Wheeler wrote: ... > > 22:23:31:481714 tid[142943] pid[142917] MAP @ rip ffffffffa43ce877 (1208 hits), gpa = cf7f000, hva = 7f6d24f7f000, pfn = 1ee50ee > > 22:23:31:481714 tid[142943] pid[142917] ITER @ rip ffffffffa43ce877 (1208 hits), gpa = cf7f000 (cf7f000), hva = 7f6d24f7f000, pfn = 1ee50ee, tdp_mmu = 1, role = 3784, count = 1 > > 22:23:31:481715 tid[142943] pid[142917] SPTE @ rip ffffffffa43ce877 (1208 hits), gpa = cf7f000, hva = 7f6d24f7f000, pfn = 1ee50ee > > 22:23:31:481716 tid[142943] pid[142917] MAP_RET @ rip ffffffffa43ce877 (1208 hits), gpa = cf7f000, hva = 7f6d24f7f000, pfn = 1ee50ee, ret = 4 > > This vCPU is making forward progress, it just happens to be taken lots of fualts > on a single RIP. MAP_RET's "ret = 4" means KVM did inded "fix" the fault, and > the faulting GPA/HVA is changing on every fault. Best guess is that the guest > is zeroing a hugepage, but the guest's 2MiB page is mapped with 4KiB EPT entries, > i.e. the vCPU is doing REP STOS on a 2MiB region. ... > > > 21:25:50:282726 tid[3484173] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (92239 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000 ... > > > 21:25:50:282354 tid[3484174] pid[3484149] FAULTIN @ rip ffffffff814e6ca5 (90073 hits), gpa = 1343fa0b0, hva = 7feb409fa000, flags = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000 > > > 21:25:50:282571 tid[3484174] pid[3484149] FAULTIN_RET @ rip ffffffff814e6ca5 (90121 hits), gpa = 1343fa0b0, hva = 7feb409fa000 (7feb409fa000), flags = 0, pfn = 18d1d46, ret = 0 : MMU seq = 8002dc25, in-prog = 0, start = 7feacde61000, end = 7feacde62000 ... > These vCPUs (which belong to the same VM) appear to be well and truly stuck. The > fact that you got prints from MAP, MAP_RET, ITER, and SPTE for an unrelated (and > not stuck) vCPU is serendipitious, as it all but guarantees that the trace is > "good", i.e. that MAP prints aren't missing because the bpf program is bad. > > Since the mmu_notifier info is stable (though the seq is still *insanely* high), > assuming there's no kernel memory corruption, that means that KVM is bailing > because is_page_fault_stale() returns true. Based on v5.15 being the last known > good kernel for you, I should note that v5.15 was speculative information that may not be 100% correct. However, the occurance of failures <= 5.15 is miniscule compared to >5.15, as almost all occurences are in 6.1.x (which could be biased by the number of 6.1.x deployments). It is entirely possible that the <= 5.15 issues are false-positives, but we don't have TBF on anything <6.1.30, so no kprobe traces. > that places the blame squarerly on commit > a955cad84cda ("KVM: x86/mmu: Retry page fault if root is invalidated by memslot update"). > Note, that commit had an unrelated but fixed by 18c841e1f411 ("KVM: x86: Retry > page fault if MMU reload is pending and root has no sp"), but all flavors of v6.1 > have said fix and the bug caused a crash, not stuck vCPUs. > > The "MMU seq = 8002dc25" value still gives me pause. It's one hell of a coincidence > that all stuck vCPUs have had a sequence counter of 0x800xxxxx. > > <time goes by as I keep staring> > > Fudge (that's not actually what I said). *sigh* > > Not a coincidence, at all. The bug is that, in v6.1, is_page_fault_stale() takes > the local @mmu_seq snapshot as an int, whereas as the per-VM count is stored as an > unsigned long. I'm surprised that there were no compiler warnings about signedness or type precision. What would have prevented such a compiler warning? > When the sequence sets bit 31, the local @mmu_seq value becomes > a signed *negative* value, and then when that gets passed to mmu_invalidate_retry_hva(), > which correctly takes an unsigned long, the negative value gets sign-extended and > so the comparison ends up being > > if (0x8002dc25 != 0xffffffff8002dc25) > > and KVM thinks the sequence count is stale. I missed it for so long because I > was stupidly looking mostly at upstream code (see below), and because of the subtle > sign-extension behavior (I was mostly on the lookout for a straight truncation > bug where bits[63:32] got dropped). > > I suspect others haven't hit this issues because no one else is generating anywhere > near the same number of mmu_notifier invalidations, and/or live migrates VMs more > regularly (which effectively resets the sequence count). > > The real kicker to all this is that the bug was accidentally fixed in v6.3 by > commit ba6e3fe25543 ("KVM: x86/mmu: Grab mmu_invalidate_seq in kvm_faultin_pfn()"), > as that refactoring correctly stored the "local" mmu_seq as an unsigned long. > > I'll post the below as a proper patch for inclusion in stable kernels. Awesome, and well done. Can you think of a "simple" patch for the 6.1-series that would be live-patch safe? -Eric > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 230108a90cf3..beca03556379 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4212,7 +4212,8 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > * root was invalidated by a memslot update or a relevant mmu_notifier fired. > */ > static bool is_page_fault_stale(struct kvm_vcpu *vcpu, > - struct kvm_page_fault *fault, int mmu_seq) > + struct kvm_page_fault *fault, > + unsigned long mmu_seq) > { > struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa); > > P.S. FWIW, it's probably worth taking a peek at your NUMA setup and/or KSM settings. > 2 billion invalidations is still quite insane, even for a long-lived VM. E.g. > we (Google) disable NUMA balancing and instead rely on other parts of the stack > to hit our SLOs for NUMA locality. That certainly has its own challenges, and > might not be viable for your environment, but NUMA balancing is far from a free > lunch for VMs; NUMA balancing is a _lot_ more costly when KVM is on the receiving > end due to the way mmu_notifier invalidations work. And for KSM, I personally > think KSM is a terrible tradeoff and should never be enabled. It saves memory at > the cost of CPU cycles, guest performance, and guest security. -- Eric Wheeler