On Tue, 22 Aug 2023, Sean Christopherson wrote: > On Mon, Aug 21, 2023, Eric Wheeler wrote: > > On Mon, 21 Aug 2023, Sean Christopherson wrote: > > > On Mon, Aug 21, 2023, Eric Wheeler wrote: > > > > On Fri, 18 Aug 2023, Eric Wheeler wrote: > > > > > On Fri, 18 Aug 2023, Sean Christopherson wrote: > > > > > > On Thu, Aug 17, 2023, Eric Wheeler wrote: > > > > > > > On Thu, 17 Aug 2023, Sean Christopherson wrote: > > > > To me, these are opaque numbers. What do they represent? What are you looking for in them? > > > > > > inprog is '1' if there is an in-progress mmu_notifier invalidation at the time > > > of the EPT violation. start/end are the range that is being invalidated _if_ > > > there is an in-progress invalidation. If a vCPU were stuck with inprog=1, then > > > the most likely scenario is that there's an unpaired invalidation, i.e. something > > > started an invalidation but never completed it. > > > > > > seq is a sequence count that is incremented when an invalidation completes, e.g. > > > if a vCPU was stuck and seq were constantly changing, then it would mean that > > > the primary MMU is invalidating the same range over and over so quickly that the > > > vCPU can't make forward progress. > > > > Here is another one, I think you described exactly this: the vcpu is > > always the same, and the sequence increments, forever: > > > > 1 ept[0] vcpu=ffff9964cdc48000 seq=80854227 inprog=1 start=7fa3183a3000 end=7fa3183a4000 > > 1 ept[0] vcpu=ffff9964cdc48000 seq=80854228 inprog=1 start=7fa3183a3000 end=7fa3183a4000 > > 1 ept[0] vcpu=ffff9964cdc48000 seq=80854229 inprog=1 start=7fa3183a4000 end=7fa3183a5000 > > 1 ept[0] vcpu=ffff9964cdc48000 seq=8085422a inprog=1 start=7fa3183a4000 end=7fa3183a5000 > > 1 ept[0] vcpu=ffff9964cdc48000 seq=8085422b inprog=1 start=7fa3183a8000 end=7fa3183a9000 > > 2 ept[0] vcpu=ffff9964cdc48000 seq=8085422d inprog=1 start=7fa3183a9000 end=7fa3183aa000 > > 1 ept[0] vcpu=ffff9964cdc48000 seq=8085422e inprog=1 start=7fa3183a9000 end=7fa3183aa000 > > 1 ept[0] vcpu=ffff9964cdc48000 seq=80854232 inprog=1 start=7fa3183ac000 end=7fa3183ad000 > > 1 ept[0] vcpu=ffff9964cdc48000 seq=80854233 inprog=1 start=7fa3183ad000 end=7fa3183ae000 > > 1 ept[0] vcpu=ffff9964cdc48000 seq=80854235 inprog=1 start=7fa3183ae000 end=7fa3183af000 > > 1 ept[0] vcpu=ffff9964cdc48000 seq=80854236 inprog=1 start=7fa3183ae000 end=7fa3183af000 > > > > Here is the whole log with 500,000+ lines over 5 minutes of recording, it > > was first stuck on one vcpu for most of the time, and toward the end it > > was stuck on a different VCPU: > > > > The file starts with 555,596 occurances of vcpu=ffff9964cdc48000 and is > > then followed by 31,784 occurances of vcpu=ffff9934ed50c680. As you can > > see in the file, they are not interleaved: > > > > https://www.linuxglobal.com/out/handle_ept_violation.log2 > > > > # awk '{print $3}' handle_ept_violation.log2 |uniq -c > > 555596 vcpu=ffff9964cdc48000 > > 31784 vcpu=ffff9934ed50c680 > > Hrm, but the address range being invalidated is changing. Without seeing the > guest RIP, or even a timestamp, it's impossible to tell if the vCPU is well and > truly stuck or if it's just getting thrashed so hard by NUMA balancing or KSM > that it looks stuck. > > Drat. > > > > Below is another bpftrace program that will hopefully shrink the > > > haystack to the point where we can find something via code inspection. > > > > Ok thanks, we'll give it a try. > > Try this version instead. It's more comprehensive and more precise, e.g. should > only trigger on the guest being 100% stuck, and also fixes a PID vs. TID goof. > > Note! Enable trace_kvm_exit before/when running this to ensure KVM grabs the guest RIP > from the VMCS. Without that enabled, RIP from vcpu->arch.regs[16] may be stale. Thanks, we'll try it out. To confirm, when you say "Enable trace_kvm_exit", is it this: echo 1 > /sys/kernel/tracing/events/kvm/kvm_exit/enable or this (which might be the same): echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_exit/enable or something else? -- Eric Wheeler > > struct kvm_page_fault { > const u64 addr; > const u32 error_code; > const bool prefetch; > > const bool exec; > const bool write; > const bool present; > const bool rsvd; > const bool user; > > const bool is_tdp; > const bool nx_huge_page_workaround_enabled; > > bool huge_page_disallowed; > u8 max_level; > > u8 req_level; > > u8 goal_level; > > u64 gfn; > > struct kvm_memory_slot *slot; > > u64 pfn; > unsigned long hva; > bool map_writable; > }; > > kprobe:kvm_faultin_pfn > { > $vcpu = (struct kvm_vcpu *)arg0; > $kvm = $vcpu->kvm; > $rip = $vcpu->arch.regs[16]; > > if (@last_rip[tid] == $rip) { > @same[tid]++ > } else { > @same[tid] = 0; > } > @last_rip[tid] = $rip; > > if (@same[tid] > 1000) { > $fault = (struct kvm_page_fault *)arg1; > $hva = -1; > $flags = 0; > > if ($fault->slot != 0) { > $hva = $fault->slot->userspace_addr + > (($fault->gfn - $fault->slot->base_gfn) << 12); > $flags = $fault->slot->flags; > } > > printf("%s tid[%u] pid[%u] stuck @ rip %lx (%lu hits), gpa = %lx, hva = %lx : MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n", > strftime("%H:%M:%S:%f", nsecs), tid, pid, $rip, @same[tid], $fault->addr, $hva, > $kvm->mmu_invalidate_seq, $kvm->mmu_invalidate_in_progress, > $kvm->mmu_invalidate_range_start, $kvm->mmu_invalidate_range_end); > } > } > > kprobe:make_mmio_spte > { > if (@same[tid] > 1000) { > $vcpu = (struct kvm_vcpu *)arg0; > $rip = $vcpu->arch.regs[16]; > > printf("%s tid[%u] pid[%u] stuck @ rip %lx made it to make_mmio_spte()\n", > strftime("%H:%M:%S:%f", nsecs), tid, pid, $rip); > } > } > > kprobe:make_spte > { > if (@same[tid] > 1000) { > $vcpu = (struct kvm_vcpu *)arg0; > $rip = $vcpu->arch.regs[16]; > > printf("%s tid[%u] pid[%u] stuck @ rip %lx made it to make_spte()\n", > strftime("%H:%M:%S:%f", nsecs), tid, pid, $rip); > } > } >