Re: Deadlock due to EPT_VIOLATION

Eric Wheeler <kvm@xxxxxxxxxxxxxxxxxx> · Thu, 17 Aug 2023 17:55:58 -0700 (PDT)

On Thu, 17 Aug 2023, Sean Christopherson wrote:
> On Wed, Aug 16, 2023, Eric Wheeler wrote:
> > On Tue, 15 Aug 2023, Sean Christopherson wrote:
> > > On Mon, Aug 14, 2023, Eric Wheeler wrote:
> > > > On Tue, 8 Aug 2023, Sean Christopherson wrote:
> > > > > > If you have any suggestions on how modifying the host kernel (and then migrating
> > > > > > a locked up guest to it) or eBPF programs that might help illuminate the issue
> > > > > > further, let me know!
> > > > > > 
> > > > > > Thanks for all your help so far!
> > > > > 
> > > > > Since it sounds like you can test with a custom kernel, try running with this
> > > > > patch and then enable the kvm_page_fault tracepoint when a vCPU gets stuck.  The
> > > > > below expands said tracepoint to capture information about mmu_notifiers and
> > > > > memslots generation.  With luck, it will reveal a smoking gun.
> > > > 
> > > > Getting this patch into production systems is challenging, perhaps live
> > > > patching is an option:
> > > 
> > > Ah, I take when you gathered information after a live migration you were migrating
> > > VMs into a sidecar environment.
> > > 
> > > > Questions:
> > > > 
> > > > 1. Do you know if this would be safe to insert as a live kernel patch?
> > > 
> > > Hmm, probably not safe.
> > > 
> > > > For example, does adding to TRACE_EVENT modify a struct (which is not
> > > > live-patch-safe) or is it something that should plug in with simple
> > > > function redirection?
> > > 
> > > Yes, the tracepoint defines a struct, e.g. in this case trace_event_raw_kvm_page_fault.
> > > 
> > > Looking back, I think I misinterpreted an earlier response regarding bpftrace and
> > > unnecessarily abandoned that tactic. *sigh*
> > > 
> > > If your environment provides btf info, then this bpftrace program should provide
> > > the mmu_notifier half of the tracepoint hack-a-patch.  If this yields nothing
> > > interesting then we can try diving into whether or not the mmu_root is stale, but
> > > let's cross that bridge when we have to.
> > > 
> > > I recommend loading this only when you have a stuck vCPU, it'll be quite noisy.
> > > 
> > > kprobe:handle_ept_violation
> > > {
> > > 	printf("vcpu = %lx pid = %u MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n",
> > > 	       arg0, ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr,
> > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq,
> > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress,
> > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start,
> > > 	       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end);
> > > }
> > > 
> > > If you don't have BTF info, we can still use a bpf program, but to get at the
> > > fields of interested, I think we'd have to resort to pointer arithmetic with struct
> > > offsets grab from your build.
> > 
> > We have BTF, so hurray for not needing struct offsets!

Well, I was part right: not all hosts have BTF.

What is involved in doing this with struct offsets for Linux v6.1.x?

> > I am testing this on a host that is not (yet) known to be stuck. Please do 
> > a quick sanity check for me and make sure this looks like the kind of 
> > output that you want to see:
> > 
> > I had to shrink the printf line because it was longer than 64 bytes. I put 
> > the process ID as the first item and changed %lx to %08lx for visual 
> > alignment. Aside from that, it is the same as what you provided.
> > 
> > We're piping it through `uniq -c` to only see interesting changes (and 
> > show counts) because it is extremely noisy. If this looks good to you then 
> > please confirm and I will run it on a production system after a lock-up:
> > 
> > 	kprobe:handle_ept_violation
> > 	{
> > 		printf("ept[%u] vcpu=%08lx seq=%08lx inprog=%lx start=%08lx end=%08lx\n",
> > 		       ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr,
> > 			arg0, 
> > 		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq,
> > 		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress,
> > 		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start,
> > 		       ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end);
> > 	}
> > 
> > Questions:
> >   - Should pid be zero?  (Note this is not yet running on a host with a 
> >     locked-up guest, in case that is the reason.)
> 
> No.  I'm not at all familiar with PID management, I just copy+pasted from
> pid_nr(), which is what KVM uses when displaying the pid in debugfs.  I printed
> the PID purely to be able to unambiguously correlated prints to vCPUs without
> needing to cross reference kernel addresses.  I.e. having the PID makes life
> easier, but it shouldn't be strictly necessary.

ok

> >   - Can you think of any reason that this would be unsafe? (Forgive my 
> >     paranoia, but of course this will be running on a production
> >     hypervisor.)
> 
> Printing the raw address of the vCPU structure will effectively neuter KASLR, but
> KASLR isn't all that much of a barrier, and whoever has permission to load a BPF
> program on the system can do far, far more damage.

agreed

> >   - Can you think of any adjustments to the bpf script above before 
> >     running this for real?
> 
> You could try and make it less noisy or more precise, e.g. by tailoring it to
> print only information on the vCPU that is stuck.  If the noise isn't a problem
> though, I would keep it as-is, the more information the better.

Ok, will leave it as-is

> > Here is an example trace on a test host that isn't locked up:
> > 
> >  ~]# bpftrace handle_ept_violation.bt | grep ^ept --line-buffered | uniq -c
> >    1926 ept[0] vcpu=ffff969569468000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
> >  215722 ept[0] vcpu=ffff9695684b8000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
> >   66280 ept[0] vcpu=ffff969569468000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
> > 18609437 ept[0] vcpu=ffff9695684b8000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000
> 
> Woah.  That's over 2 *billion* invalidations for a single VM.  Even if that's a
> long-lived VM, that's still seems rather insane.  E.g. if the uptime of that VM
> *on that host* is 6 months, my back of the napkin math says that that's nearly
> 100 invalidations every second for 6 months straight.
> 
> Bit 31 being set in relative isolation almost makes me wonder if mmu_invalidate_seq
> got corrupted somehow.  Either that or you are thrashing that VM with a vengeance.

Not sure what is happening on that host, but it could be being thrashed by 
another dev to try and reproduce the bug for bisect, but we don't have a 
reproducer yet...

--
Eric Wheeler