Re: Deadlock due to EPT_VIOLATION

Eric Wheeler <kvm@xxxxxxxxxxxxxxxxxx> · Wed, 16 Aug 2023 16:54:50 -0700 (PDT)

On Tue, 15 Aug 2023, Sean Christopherson wrote:
> On Mon, Aug 14, 2023, Eric Wheeler wrote:
> > On Tue, 8 Aug 2023, Sean Christopherson wrote:
> > > > If you have any suggestions on how modifying the host kernel (and then migrating
> > > > a locked up guest to it) or eBPF programs that might help illuminate the issue
> > > > further, let me know!
> > > > 
> > > > Thanks for all your help so far!
> > > 
> > > Since it sounds like you can test with a custom kernel, try running with this
> > > patch and then enable the kvm_page_fault tracepoint when a vCPU gets stuck.  The
> > > below expands said tracepoint to capture information about mmu_notifiers and
> > > memslots generation.  With luck, it will reveal a smoking gun.
> > 
> > Getting this patch into production systems is challenging, perhaps live
> > patching is an option:
> 
> Ah, I take when you gathered information after a live migration you were migrating
> VMs into a sidecar environment.
> 
> > Questions:
> > 
> > 1. Do you know if this would be safe to insert as a live kernel patch?
> 
> Hmm, probably not safe.
> 
> > For example, does adding to TRACE_EVENT modify a struct (which is not
> > live-patch-safe) or is it something that should plug in with simple
> > function redirection?
> 
> Yes, the tracepoint defines a struct, e.g. in this case trace_event_raw_kvm_page_fault.
> 
> Looking back, I think I misinterpreted an earlier response regarding bpftrace and
> unnecessarily abandoned that tactic. *sigh*
> 
> If your environment provides btf info, then this bpftrace program should provide
> the mmu_notifier half of the tracepoint hack-a-patch.  If this yields nothing
> interesting then we can try diving into whether or not the mmu_root is stale, but
> let's cross that bridge when we have to.
> 
> I recommend loading this only when you have a stuck vCPU, it'll be quite noisy.
> 
> kprobe:handle_ept_violation
> {
> 	printf("vcpu = %lx pid = %u MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> 	       arg0, ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr,
> 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq,
> 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress,
> 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start,
> 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end);
> }
> 
> If you don't have BTF info, we can still use a bpf program, but to get at the
> fields of interested, I think we'd have to resort to pointer arithmetic with struct
> offsets grab from your build.

We have BTF, so hurray for not needing struct offsets!

I am testing this on a host that is not (yet) known to be stuck. Please do 
a quick sanity check for me and make sure this looks like the kind of 
output that you want to see:

I had to shrink the printf line because it was longer than 64 bytes. I put 
the process ID as the first item and changed %lx to %08lx for visual 
alignment. Aside from that, it is the same as what you provided.

We're piping it through `uniq -c` to only see interesting changes (and 
show counts) because it is extremely noisy. If this looks good to you then 
please confirm and I will run it on a production system after a lock-up:

	kprobe:handle_ept_violation
	{
		printf("ept[%u] vcpu=%08lx seq=%08lx inprog=%lx start=%08lx end=%08lx\n",
		       ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr,
			arg0, 
		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq,
		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress,
		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start,
		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end);
	}

Questions:
  - Should pid be zero?  (Note this is not yet running on a host with a 
    locked-up guest, in case that is the reason.)

  - Can you think of any reason that this would be unsafe? (Forgive my 
    paranoia, but of course this will be running on a production
    hypervisor.)

  - Can you think of any adjustments to the bpf script above before 
    running this for real?

Here is an example trace on a test host that isn't locked up:

 ~]# bpftrace handle_ept_violation.bt | grep ^ept --line-buffered | uniq -c
   1926 ept[0] vcpu=ffff969569468000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
 215722 ept[0] vcpu=ffff9695684b8000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
  66280 ept[0] vcpu=ffff969569468000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
18609437 ept[0] vcpu=ffff9695684b8000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
     30 ept[0] vcpu=ffff96955de90000 seq=001fa362 inprog=0 start=7fa25ef0f000 end=7fa25ef10000
      1 ept[0] vcpu=ffff96955de92340 seq=001fa44e inprog=0 start=7fa23f789000 end=7fa23f78a000
      2 ept[0] vcpu=ffff96955de92340 seq=001fa59f inprog=0 start=7fa23dfe8000 end=7fa23dfe9000
      2 ept[0] vcpu=ffff96955de92340 seq=001fa5a0 inprog=0 start=7fa23b723000 end=7fa23b724000
      1 ept[0] vcpu=ffff96955de92340 seq=001fa5a1 inprog=0 start=7fa238d50000 end=7fa238d51000
      1 ept[0] vcpu=ffff96955de92340 seq=001fa5a5 inprog=0 start=7fa24d920000 end=7fa24d921000
      1 ept[0] vcpu=ffff96955de92340 seq=001fa5a6 inprog=0 start=7fa238a73000 end=7fa238a74000
      1 ept[0] vcpu=ffff96955de92340 seq=001fa5ea inprog=0 start=7fa244791000 end=7fa244792000
      1 ept[0] vcpu=ffff96955de92340 seq=001fa5eb inprog=0 start=7fa24c988000 end=7fa24c989000
      3 ept[0] vcpu=ffff96955de92340 seq=001fa5ec inprog=0 start=7fa23f78b000 end=7fa23f78c000
      2 ept[0] vcpu=ffff96955de92340 seq=001fa5ed inprog=0 start=7fa24256a000 end=7fa24256b000
      2 ept[0] vcpu=ffff96955de92340 seq=001fa5ee inprog=0 start=7fa24ed2b000 end=7fa24ed2c000