Re: Deadlock due to EPT_VIOLATION

Amaan Cheval <amaan.cheval@xxxxxxxxx> · Fri, 21 Jul 2023 20:04:07 +0530

Hey Sean,

I'm helping Brian look into this issue.

> Would you be able to run a bpftrace program on a host with a stuck guest?  If so,
> I believe I could craft a program for the kvm_exit tracepoint that would rule out
> or confirm two of the three likely culprits.

Could you share your thoughts on what the 2-3 likely culprits might be, and the
bpftrace program if possible?

I ran one just to dump all args on the kvm_exit tracepoint on an affected host,
here's a snippet:

```
# bpftrace -e 'tracepoint:kvm:kvm_exit { printf("%s: rip=%lx reason=%u isa=%u info1=%lx info2=%lx intr=%u error=%u vcpu=%u \n", comm, args->guest_rip, args->exit_reason, args->isa, args->info1, args->info2, args->intr_info, args->error_code, args->vcpu_id); }'

CPU 3/KVM: rip=ffffffffa746d5f8 reason=32 isa=1 info1=0 info2=0 intr=0 error=0 vcpu=3
CPU 3/KVM: rip=ffffffffa746d5fa reason=1 isa=1 info1=0 info2=0 intr=2147483894 error=0 vcpu=3
CPU 0/KVM: rip=ffffffffa746d5f8 reason=32 isa=1 info1=0 info2=0 intr=0 error=0 vcpu=0
CPU 0/KVM: rip=ffffffffa746d5fa reason=1 isa=1 info1=0 info2=0 intr=2147483894 error=0 vcpu=0
CPU 0/KVM: rip=ffffffffa746d5f8 reason=32 isa=1 info1=0 info2=0 intr=0 error=0 vcpu=0
CPU 0/KVM: rip=ffffffffa746d5fa reason=1 isa=1 info1=0 info2=0 intr=2147483894 error=0 vcpu=0
CPU 0/KVM: rip=ffffffffa746d5f8 reason=32 isa=1 info1=0 info2=0 intr=0 error=0 vcpu=0
CPU 0/KVM: rip=ffffffffa746d5fa reason=1 isa=1 info1=0 info2=0 intr=2147483894 error=0 vcpu=0
CPU 0/KVM: rip=ffffffffa746d5f8 reason=32 isa=1 info1=0 info2=0 intr=0 error=0 vcpu=0
CPU 0/KVM: rip=ffffffffa746d5fa reason=1 isa=1 info1=0 info2=0 intr=2147483894 error=0 vcpu=0
CPU 0/KVM: rip=ffffffffa7b88eaa reason=12 isa=1 info1=0 info2=0 intr=0 error=0 vcpu=0
CPU 3/KVM: rip=7ff4543b74e4 reason=1 isa=1 info1=0 info2=0 intr=2147483884 error=0 vcpu=3
CPU 0/KVM: rip=ffffffff94f3ff15 reason=1 isa=1 info1=0 info2=0 intr=2147483884 error=0 vcpu=0
CPU 0/KVM: rip=ffffffff94e683a8 reason=32 isa=1 info1=0 info2=0 intr=0 error=0 vcpu=0
CPU 0/KVM: rip=ffffffff94e683aa reason=1 isa=1 info1=0 info2=0 intr=2147483894 error=0 vcpu=0
CPU 0/KVM: rip=ffffffff95516005 reason=12 isa=1 info1=0 info2=0 intr=0 error=0 vcpu=0
CPU 3/KVM: rip=7ff45260dd24 reason=48 isa=1 info1=181 info2=0 intr=0 error=0 vcpu=3
CPU 3/KVM: rip=7ff45260dd24 reason=48 isa=1 info1=181 info2=0 intr=0 error=0 vcpu=3
CPU 3/KVM: rip=7ff45260df88 reason=48 isa=1 info1=181 info2=0 intr=0 error=0 vcpu=3
CPU 3/KVM: rip=7ff45260df88 reason=48 isa=1 info1=181 info2=0 intr=0 error=0 vcpu=3
```

I've also run a `function_graph` trace on some of the affected hosts, if you
think it might be helpful to have a look at that to see what the host kernel
might be doing while the guests are looping on EPT_VIOLATIONs. Nothing obvious
stands out to me right now.

We suspected KSM briefly, but ruled that out by turning KSM off and unmerging
KSM pages - after doing that, a guest VM still locked up / started looping
EPT_VIOLATIONS (like in Brian's original email), so it's unlikely this is KSM specific.

Another interesting observation we made was that when we migrate a guest to a
different host, the guest _stays_ locked up and throws EPT violations on the new
host as well - so it's unlikely the issue is in the guest kernel itself (since
we see it across guest operating systems), but perhaps the host kernel is
messing the state of the guest kernel up in a way that keeps it locked up after
migrating as well?

If you have any thoughts on anything else to try, let me know!