Re: Deadlock due to EPT_VIOLATION

Sean Christopherson <seanjc@xxxxxxxxxx> · Thu, 17 Aug 2023 11:21:03 -0700

On Wed, Aug 16, 2023, Eric Wheeler wrote:
> On Tue, 15 Aug 2023, Sean Christopherson wrote:
> > On Mon, Aug 14, 2023, Eric Wheeler wrote:
> > > On Tue, 8 Aug 2023, Sean Christopherson wrote:
> > > > > If you have any suggestions on how modifying the host kernel (and then migrating
> > > > > a locked up guest to it) or eBPF programs that might help illuminate the issue
> > > > > further, let me know!
> > > > > 
> > > > > Thanks for all your help so far!
> > > > 
> > > > Since it sounds like you can test with a custom kernel, try running with this
> > > > patch and then enable the kvm_page_fault tracepoint when a vCPU gets stuck.  The
> > > > below expands said tracepoint to capture information about mmu_notifiers and
> > > > memslots generation.  With luck, it will reveal a smoking gun.
> > > 
> > > Getting this patch into production systems is challenging, perhaps live
> > > patching is an option:
> > 
> > Ah, I take when you gathered information after a live migration you were migrating
> > VMs into a sidecar environment.
> > 
> > > Questions:
> > > 
> > > 1. Do you know if this would be safe to insert as a live kernel patch?
> > 
> > Hmm, probably not safe.
> > 
> > > For example, does adding to TRACE_EVENT modify a struct (which is not
> > > live-patch-safe) or is it something that should plug in with simple
> > > function redirection?
> > 
> > Yes, the tracepoint defines a struct, e.g. in this case trace_event_raw_kvm_page_fault.
> > 
> > Looking back, I think I misinterpreted an earlier response regarding bpftrace and
> > unnecessarily abandoned that tactic. *sigh*
> > 
> > If your environment provides btf info, then this bpftrace program should provide
> > the mmu_notifier half of the tracepoint hack-a-patch.  If this yields nothing
> > interesting then we can try diving into whether or not the mmu_root is stale, but
> > let's cross that bridge when we have to.
> > 
> > I recommend loading this only when you have a stuck vCPU, it'll be quite noisy.
> > 
> > kprobe:handle_ept_violation
> > {
> > 	printf("vcpu = %lx pid = %u MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> > 	       arg0, ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr,
> > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq,
> > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress,
> > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start,
> > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end);
> > }
> > 
> > If you don't have BTF info, we can still use a bpf program, but to get at the
> > fields of interested, I think we'd have to resort to pointer arithmetic with struct
> > offsets grab from your build.
> 
> We have BTF, so hurray for not needing struct offsets!
> 
> I am testing this on a host that is not (yet) known to be stuck. Please do 
> a quick sanity check for me and make sure this looks like the kind of 
> output that you want to see:
> 
> I had to shrink the printf line because it was longer than 64 bytes. I put 
> the process ID as the first item and changed %lx to %08lx for visual 
> alignment. Aside from that, it is the same as what you provided.
> 
> We're piping it through `uniq -c` to only see interesting changes (and 
> show counts) because it is extremely noisy. If this looks good to you then 
> please confirm and I will run it on a production system after a lock-up:
> 
> 	kprobe:handle_ept_violation
> 	{
> 		printf("ept[%u] vcpu=%08lx seq=%08lx inprog=%lx start=%08lx end=%08lx\n",
> 		       ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr,
> 			arg0, 
> 		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq,
> 		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress,
> 		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start,
> 		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end);
> 	}
> 
> Questions:
>   - Should pid be zero?  (Note this is not yet running on a host with a 
>     locked-up guest, in case that is the reason.)

No.  I'm not at all familiar with PID management, I just copy+pasted from
pid_nr(), which is what KVM uses when displaying the pid in debugfs.  I printed
the PID purely to be able to unambiguously correlated prints to vCPUs without
needing to cross reference kernel addresses.  I.e. having the PID makes life
easier, but it shouldn't be strictly necessary.

>   - Can you think of any reason that this would be unsafe? (Forgive my 
>     paranoia, but of course this will be running on a production
>     hypervisor.)

Printing the raw address of the vCPU structure will effectively neuter KASLR, but
KASLR isn't all that much of a barrier, and whoever has permission to load a BPF
program on the system can do far, far more damage.

>   - Can you think of any adjustments to the bpf script above before 
>     running this for real?

You could try and make it less noisy or more precise, e.g. by tailoring it to
print only information on the vCPU that is stuck.  If the noise isn't a problem
though, I would keep it as-is, the more information the better.

> Here is an example trace on a test host that isn't locked up:
> 
>  ~]# bpftrace handle_ept_violation.bt | grep ^ept --line-buffered | uniq -c
>    1926 ept[0] vcpu=ffff969569468000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
>  215722 ept[0] vcpu=ffff9695684b8000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
>   66280 ept[0] vcpu=ffff969569468000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
> 18609437 ept[0] vcpu=ffff9695684b8000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000

Woah.  That's over 2 *billion* invalidations for a single VM.  Even if that's a
long-lived VM, that's still seems rather insane.  E.g. if the uptime of that VM
*on that host* is 6 months, my back of the napkin math says that that's nearly
100 invalidations every second for 6 months straight.

Bit 31 being set in relative isolation almost makes me wonder if mmu_invalidate_seq
got corrupted somehow.  Either that or you are thrashing that VM with a vengeance.