Re: Deadlock due to EPT_VIOLATION

Amaan Cheval <amaan.cheval@xxxxxxxxx> · Wed, 2 Aug 2023 22:15:36 +0530

> LOL, NUMA autobalancing.  I have a longstanding hatred of that feature.  I'm sure
> there are setups where it adds value, but from my perspective it's nothing but
> pain and misery.

Do you think autobalancing is increasing the odds of some edge-case race
condition, perhaps?
I find it really curious that numa_balancing definitely affects this issue, but
particularly when thp=0. Is it just too many EPT entries to install
when transparent hugepages is disabled, increasing the likelihood of
a race condition / lock contention of some sort?

> > They still remain locked up, but that might be because the original cause of the
> > looping EPT_VIOLATIONs corrupted/crashed them in an unrecoverable way (are there
> > any ways you can think of that that might happen)?
>
> Define "remain locked up".  If the vCPUs are actively running in the guest and
> making forward progress, i.e. not looping on VM-Exits on a single RIP, then they
> aren't stuck from KVM's perspective.

Right, the traces look like they're not stuck (i.e. no looping on the same
RIP). By "remain locked up" I mean that the VM is unresponsive on both the
console and services (such as ssh) used to connect to it.

> But that doesn't mean the guest didn't take punitive action when a vCPU was
> effectively stalled indefinitely by KVM, e.g. from the guest's perspective the
> stuck vCPU will likely manifest as a soft lockup, and that could lead to a panic()
> if the guest is a Linux kernel running with softlockup_panic=1.

So far we haven't had any guest kernels with softlockup_panic=1 have this issue,
so it's hard to confirm, but it makes sense that the guest took punitive action
in response to being stalled.

Any thoughts on how we might reproduce the issue or trace it down better?

Anything look suspect in the function_graph trace?
(Note that this was on a host that had numa_balancing=0,thp=1 from before
the guest booted, and it still ended up in the EPT_VIOLATION loop and
"locked up" (unresponsive on console).)