Re: Deadlock due to EPT_VIOLATION

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 22 Aug 2023 08:11:57 -0700

On Mon, Aug 21, 2023, Eric Wheeler wrote:
> On Mon, 21 Aug 2023, Sean Christopherson wrote:
> > On Mon, Aug 21, 2023, Eric Wheeler wrote:
> > > On Fri, 18 Aug 2023, Eric Wheeler wrote:
> > > > On Fri, 18 Aug 2023, Sean Christopherson wrote:
> > > > > On Thu, Aug 17, 2023, Eric Wheeler wrote:
> > > > > > On Thu, 17 Aug 2023, Sean Christopherson wrote:
> > > To me, these are opaque numbers.  What do they represent?  What are you looking for in them?
> > 
> > inprog is '1' if there is an in-progress mmu_notifier invalidation at the time
> > of the EPT violation.  start/end are the range that is being invalidated _if_
> > there is an in-progress invalidation.  If a vCPU were stuck with inprog=1, then
> > the most likely scenario is that there's an unpaired invalidation, i.e. something
> > started an invalidation but never completed it.
> > 
> > seq is a sequence count that is incremented when an invalidation completes, e.g.
> > if a vCPU was stuck and seq were constantly changing, then it would mean that
> > the primary MMU is invalidating the same range over and over so quickly that the
> > vCPU can't make forward progress.
> 
> Here is another one, I think you described exactly this: the vcpu is 
> always the same, and the sequence increments, forever:
> 
>       1 ept[0] vcpu=ffff9964cdc48000 seq=80854227 inprog=1 start=7fa3183a3000 end=7fa3183a4000
>       1 ept[0] vcpu=ffff9964cdc48000 seq=80854228 inprog=1 start=7fa3183a3000 end=7fa3183a4000
>       1 ept[0] vcpu=ffff9964cdc48000 seq=80854229 inprog=1 start=7fa3183a4000 end=7fa3183a5000
>       1 ept[0] vcpu=ffff9964cdc48000 seq=8085422a inprog=1 start=7fa3183a4000 end=7fa3183a5000
>       1 ept[0] vcpu=ffff9964cdc48000 seq=8085422b inprog=1 start=7fa3183a8000 end=7fa3183a9000
>       2 ept[0] vcpu=ffff9964cdc48000 seq=8085422d inprog=1 start=7fa3183a9000 end=7fa3183aa000
>       1 ept[0] vcpu=ffff9964cdc48000 seq=8085422e inprog=1 start=7fa3183a9000 end=7fa3183aa000
>       1 ept[0] vcpu=ffff9964cdc48000 seq=80854232 inprog=1 start=7fa3183ac000 end=7fa3183ad000
>       1 ept[0] vcpu=ffff9964cdc48000 seq=80854233 inprog=1 start=7fa3183ad000 end=7fa3183ae000
>       1 ept[0] vcpu=ffff9964cdc48000 seq=80854235 inprog=1 start=7fa3183ae000 end=7fa3183af000
>       1 ept[0] vcpu=ffff9964cdc48000 seq=80854236 inprog=1 start=7fa3183ae000 end=7fa3183af000
> 
> Here is the whole log with 500,000+ lines over 5 minutes of recording, it 
> was first stuck on one vcpu for most of the time, and toward the end it 
> was stuck on a different VCPU:
> 
> The file starts with 555,596 occurances of vcpu=ffff9964cdc48000 and is 
> then followed by 31,784 occurances of vcpu=ffff9934ed50c680.  As you can 
> see in the file, they are not interleaved:
> 
> 	https://www.linuxglobal.com/out/handle_ept_violation.log2
> 
>   # awk '{print $3}' handle_ept_violation.log2 |uniq -c
>    555596 vcpu=ffff9964cdc48000
>     31784 vcpu=ffff9934ed50c680

Hrm, but the address range being invalidated is changing.  Without seeing the
guest RIP, or even a timestamp, it's impossible to tell if the vCPU is well and
truly stuck or if it's just getting thrashed so hard by NUMA balancing or KSM
that it looks stuck.

Drat.

> > Below is another bpftrace program that will hopefully shrink the 
> > haystack to the point where we can find something via code inspection.
> 
> Ok thanks, we'll give it a try.

Try this version instead.  It's more comprehensive and more precise, e.g. should
only trigger on the guest being 100% stuck, and also fixes a PID vs. TID goof.

Note!  Enable trace_kvm_exit before/when running this to ensure KVM grabs the guest RIP
from the VMCS.  Without that enabled, RIP from vcpu->arch.regs[16] may be stale.

struct kvm_page_fault {
	const u64 addr;
	const u32 error_code;
	const bool prefetch;

	const bool exec;
	const bool write;
	const bool present;
	const bool rsvd;
	const bool user;

	const bool is_tdp;
	const bool nx_huge_page_workaround_enabled;

	bool huge_page_disallowed;
	u8 max_level;

	u8 req_level;

	u8 goal_level;

	u64 gfn;

	struct kvm_memory_slot *slot;

	u64 pfn;
	unsigned long hva;
	bool map_writable;
};

kprobe:kvm_faultin_pfn
{
	$vcpu = (struct kvm_vcpu *)arg0;
	$kvm = $vcpu->kvm;
	$rip = $vcpu->arch.regs[16];

	if (@last_rip[tid] == $rip) {
		@same[tid]++
	} else {
		@same[tid] = 0;
	}
	@last_rip[tid] = $rip;

	if (@same[tid] > 1000) {
		$fault = (struct kvm_page_fault *)arg1;
		$hva = -1;
		$flags = 0;

		if ($fault->slot != 0) {
			$hva = $fault->slot->userspace_addr +
			       (($fault->gfn - $fault->slot->base_gfn) << 12);
			$flags = $fault->slot->flags;
		}

		printf("%s tid[%u] pid[%u] stuck @ rip %lx (%lu hits), gpa = %lx, hva = %lx : MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
		       strftime("%H:%M:%S:%f", nsecs), tid, pid, $rip, @same[tid], $fault->addr, $hva,
		       $kvm->mmu_invalidate_seq, $kvm->mmu_invalidate_in_progress,
		       $kvm->mmu_invalidate_range_start, $kvm->mmu_invalidate_range_end);
	}
}

kprobe:make_mmio_spte
{
        if (@same[tid] > 1000) {
	        $vcpu = (struct kvm_vcpu *)arg0;
		$rip = $vcpu->arch.regs[16];

		printf("%s tid[%u] pid[%u] stuck @ rip %lx made it to make_mmio_spte()\n",
		       strftime("%H:%M:%S:%f", nsecs), tid, pid, $rip);
	}
}

kprobe:make_spte
{
        if (@same[tid] > 1000) {
	        $vcpu = (struct kvm_vcpu *)arg0;
		$rip = $vcpu->arch.regs[16];

		printf("%s tid[%u] pid[%u] stuck @ rip %lx made it to make_spte()\n",
		       strftime("%H:%M:%S:%f", nsecs), tid, pid, $rip);
	}
}