Re: Deadlock due to EPT_VIOLATION

Amaan Cheval <amaan.cheval@xxxxxxxxx> · Wed, 2 Aug 2023 19:51:52 +0530

> Yeesh.  There is a ridiculous amount of potentially problematic activity.  KSM is
> active in that trace, it looks like NUMA balancing might be in play,

Sorry about the delayed response - it seems like the majority of locked up guest
VMs stop throwing repeated EPT_VIOLATIONs as soon as we turn `numa_balancing`
off.
They still remain locked up, but that might be because the original cause of the
looping EPT_VIOLATIONs corrupted/crashed them in an unrecoverable way (are there
any ways you can think of that that might happen)?

----

We experimented with numa_balancing + transparent hugepage settings in certain
data centers (to determine if the settings make the lockups disappear) and the
incidence rate of locked up guests has lowered significantly for the
numa_balancing=0 and thp=1 case, but numa_balancing=0 and thp=0 are still
locking up / looping on EPT_VIOLATIONs at about the same rate (or slightly
lower than both numa_balancing=thp=1).

Here's a function_graph of a host which had numa_balancing=0, thp=1, ksm=2
(KSM unloaded and unmerged after it was initially on):

https://transfer.sh/M4WdfxaTJs/ept-fn-graph.log

```
# bpftrace -e 'kprobe:handle_ept_violation { @ept[comm] = count(); }
tracepoint:kvm:kvm_page_fault { @pf[comm] = count(); }'
Attaching 2 probes...
^C

@ept[CPU 0/KVM]: 52
@ept[CPU 3/KVM]: 61
@ept[CPU 2/KVM]: 112
@ept[CPU 1/KVM]: 257

@pf[CPU 0/KVM]: 52
@pf[CPU 3/KVM]: 61
@pf[CPU 2/KVM]: 111
@pf[CPU 1/KVM]: 262
```

> there might be hugepage shattering, etc.

Is there a BPF program / another way we can confirm this is the case? I think
the fact that guests lockup at about the same rate when thp=0,numa_balancing=0
as thp=1,numa_balancing=1 is interesting and relevant.

Only thp=1,numa_balancing=0 seems to have the least guests locking up.

> Let me rephrase that statement: it rules out a certain class of memslot and
> mmu_notifier bugs, namely bugs where KVM would incorrect leave an invalidation
> refcount (for lack of a better term) elevated.  It doesn't mean memslot changes
> and/or mmu_notifier events aren't at fault.

I see, thanks!

> kernel bug, e.g. it's possible the vCPU is stuck purely because it's being trashed
> to the point where it can't make forward progress.

Given that the guest stays locked-up post-migration on a completely unloaded
host, I think this is unlikely unless the thrashing also corrupts the guests'
state before the migration somehow?

> Yeah.  Definitely not related async page fault.

I guess the biggest lead currently is why `numa_balancing=1` increases the
odds of this issue occurring, and why is it specifically more likely with
transparent hugepages off (`thp=0`)?

To be clear, the lockups occur in all configurations we've tried so far, so none
of these are likely the direct cause, just relevant factors.

If there's any changes in the kernel that might help illuminate the issue
further, we can run a custom kernel and migrate a guest to the modified host -
let me know if there's anything that might help!