On Mon, 21 Aug 2023, Sean Christopherson wrote: > On Mon, Aug 21, 2023, Eric Wheeler wrote: > > On Fri, 18 Aug 2023, Eric Wheeler wrote: > > > On Fri, 18 Aug 2023, Sean Christopherson wrote: > > > > On Thu, Aug 17, 2023, Eric Wheeler wrote: > > > > > On Thu, 17 Aug 2023, Sean Christopherson wrote: > > > > > > > > kprobe:handle_ept_violation > > > > > > > > { > > > > > > > > printf("vcpu = %lx pid = %u MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n", > > > > > > > > arg0, ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr, > > > > > > > > ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq, > > > > > > > > ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress, > > > > > > > > ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start, > > > > > > > > ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end); > > > > > > > > } > > > > > > > > > > > > > > > > If you don't have BTF info, we can still use a bpf program, but to get at the > > > > > > > > fields of interested, I think we'd have to resort to pointer arithmetic with struct > > > > > > > > offsets grab from your build. > > > > > > > > > > > > > > We have BTF, so hurray for not needing struct offsets! > > > > We found a new sample in 6.1.38, right after a lockup, where _all_ log > > entries show inprog=1, in case that is interesting. Here is a sample, > > there are 500,000+ entries so let me know if you want the whole log. > > > > To me, these are opaque numbers. What do they represent? What are you looking for in them? > > inprog is '1' if there is an in-progress mmu_notifier invalidation at the time > of the EPT violation. start/end are the range that is being invalidated _if_ > there is an in-progress invalidation. If a vCPU were stuck with inprog=1, then > the most likely scenario is that there's an unpaired invalidation, i.e. something > started an invalidation but never completed it. > > seq is a sequence count that is incremented when an invalidation completes, e.g. > if a vCPU was stuck and seq were constantly changing, then it would mean that > the primary MMU is invalidating the same range over and over so quickly that the > vCPU can't make forward progress. Here is another one, I think you described exactly this: the vcpu is always the same, and the sequence increments, forever: 1 ept[0] vcpu=ffff9964cdc48000 seq=80854227 inprog=1 start=7fa3183a3000 end=7fa3183a4000 1 ept[0] vcpu=ffff9964cdc48000 seq=80854228 inprog=1 start=7fa3183a3000 end=7fa3183a4000 1 ept[0] vcpu=ffff9964cdc48000 seq=80854229 inprog=1 start=7fa3183a4000 end=7fa3183a5000 1 ept[0] vcpu=ffff9964cdc48000 seq=8085422a inprog=1 start=7fa3183a4000 end=7fa3183a5000 1 ept[0] vcpu=ffff9964cdc48000 seq=8085422b inprog=1 start=7fa3183a8000 end=7fa3183a9000 2 ept[0] vcpu=ffff9964cdc48000 seq=8085422d inprog=1 start=7fa3183a9000 end=7fa3183aa000 1 ept[0] vcpu=ffff9964cdc48000 seq=8085422e inprog=1 start=7fa3183a9000 end=7fa3183aa000 1 ept[0] vcpu=ffff9964cdc48000 seq=80854232 inprog=1 start=7fa3183ac000 end=7fa3183ad000 1 ept[0] vcpu=ffff9964cdc48000 seq=80854233 inprog=1 start=7fa3183ad000 end=7fa3183ae000 1 ept[0] vcpu=ffff9964cdc48000 seq=80854235 inprog=1 start=7fa3183ae000 end=7fa3183af000 1 ept[0] vcpu=ffff9964cdc48000 seq=80854236 inprog=1 start=7fa3183ae000 end=7fa3183af000 Here is the whole log with 500,000+ lines over 5 minutes of recording, it was first stuck on one vcpu for most of the time, and toward the end it was stuck on a different VCPU: The file starts with 555,596 occurances of vcpu=ffff9964cdc48000 and is then followed by 31,784 occurances of vcpu=ffff9934ed50c680. As you can see in the file, they are not interleaved: https://www.linuxglobal.com/out/handle_ept_violation.log2 # awk '{print $3}' handle_ept_violation.log2 |uniq -c 555596 vcpu=ffff9964cdc48000 31784 vcpu=ffff9934ed50c680 > Below is another bpftrace program that will hopefully shrink the > haystack to the point where we can find something via code inspection. Ok thanks, we'll give it a try. > Just a heads up, if we don't find a smoking gun, I likely won't be able > to help beyond high level guidance as we're approaching what I can do > with bpftrace without a significant time investment (which I can't > make). I understand, I'm grateful for your help, whatever you can offer. -Eric > > --- > struct kvm_page_fault { > const gpa_t addr; > const u32 error_code; > const bool prefetch; > > const bool exec; > const bool write; > const bool present; > const bool rsvd; > const bool user; > > const bool is_tdp; > const bool nx_huge_page_workaround_enabled; > > bool huge_page_disallowed; > u8 max_level; > > u8 req_level; > > u8 goal_level; > > gfn_t gfn; > > struct kvm_memory_slot *slot; > > kvm_pfn_t pfn; > unsigned long hva; > bool map_writable; > }; > > kprobe:handle_ept_violation > { > $vcpu = (struct kvm_vcpu *)arg0; > $nr_handled = $vcpu->stat.pf_emulate + $vcpu->stat.pf_fixed + $vcpu->stat.pf_spurious; > > if (($vcpu->stat.pf_taken / 5) > $nr_handled) { > printf("PID = %u stuck, taken = %lu, emulated = %lu, fixed = %lu, spurious = %lu\n", > pid, $vcpu->stat.pf_taken, $vcpu->stat.pf_emulate, $vcpu->stat.pf_fixed, $vcpu->stat.pf_spurious); > } > } > > kprobe:kvm_faultin_pfn > { > $vcpu = (struct kvm_vcpu *)arg0; > $nr_handled = $vcpu->stat.pf_emulate + $vcpu->stat.pf_fixed + $vcpu->stat.pf_spurious; > > if (($vcpu->stat.pf_taken / 5) > $nr_handled) { > $fault = (struct kvm_page_fault *)arg1; > $flags = 0; > > if ($fault->slot != 0) { > $flags = $fault->slot->flags; > } > > printf("PID = %u stuck, reached kvm_faultin_pfn(), slot = %lx, flags = %lx\n", > pid, arg1, $flags); > } > } > > kprobe:make_mmio_spte > { > $vcpu = (struct kvm_vcpu *)arg0; > $nr_handled = $vcpu->stat.pf_emulate + $vcpu->stat.pf_fixed + $vcpu->stat.pf_spurious; > > if (($vcpu->stat.pf_taken / 5) > $nr_handled) { > printf("PID %u stuck, reached make_mmio_spte()", pid); > } > } > > kprobe:make_spte > { > $vcpu = (struct kvm_vcpu *)arg0; > $nr_handled = $vcpu->stat.pf_emulate + $vcpu->stat.pf_fixed + $vcpu->stat.pf_spurious; > > if (($vcpu->stat.pf_taken / 5) > $nr_handled) { > printf("PID %u stuck, reached make_spte()", pid); > } > } > >