Re: Deadlock due to EPT_VIOLATION

Sean Christopherson <seanjc@xxxxxxxxxx> · Wed, 2 Aug 2023 10:52:45 -0700

On Wed, Aug 02, 2023, Amaan Cheval wrote:
> > LOL, NUMA autobalancing.  I have a longstanding hatred of that feature.  I'm sure
> > there are setups where it adds value, but from my perspective it's nothing but
> > pain and misery.
> 
> Do you think autobalancing is increasing the odds of some edge-case race
> condition, perhaps?
> I find it really curious that numa_balancing definitely affects this issue, but
> particularly when thp=0. Is it just too many EPT entries to install
> when transparent hugepages is disabled, increasing the likelihood of
> a race condition / lock contention of some sort?

NUMA balancing works by zapping PTEs[*] in userspace page tables for mappings to
remote memory, and then migrating the data to local memory on the resulting page
fault.  When that memory is being used to back a KVM guest, zapping the userspace
(primary) PTEs triggers an mmu_notifier event that in turn saps KVM's PTEs, a.k.a.
SPTEs (which used to mean Shadow PTEs, but we're retroactively redefining SPTE to
also mean Secondary PTEs so that it's correct when shadow paging isn't being used).

If NUMA balancing is going nuclear and constantly zapping PTEs, the resulting
mmu_notifier events could theoretically stall a vCPU indefinitely.  The reason I
dislike NUMA balancing is that it's all too easy to end up with subtle bugs
and/or misconfigured setups where the NUMA balancing logic zaps PTEs/SPTEs without
actuablly being able to move the page in the end, i.e. it's (IMO) too easy for
NUMA balancing to get false positives when determining whether or not to try and
migrate a page.

That said, it's definitely very unexpected that NUMA balancing would be zapping
SPTEs to the point where a vCPU can't make forward progress.   It's theoretically
possible that that's what's happening, but quite unlikely, especially since it
sounds like you're seeing issues even with NUMA balancing disabled.

More likely is that there is a bug somewhere that results in the mmu_notifier
event refcount staying incorrectly eleveated, but that type of bug shouldn't follow
the VM across a live migration...

[*] Not technically a full zap of the PTE, it's just marked PROT_NONE, i.e.
    !PRESET, but on the KVM side of things it does manifest as a full zap of the
    SPTE.

> > > They still remain locked up, but that might be because the original cause of the
> > > looping EPT_VIOLATIONs corrupted/crashed them in an unrecoverable way (are there
> > > any ways you can think of that that might happen)?
> >
> > Define "remain locked up".  If the vCPUs are actively running in the guest and
> > making forward progress, i.e. not looping on VM-Exits on a single RIP, then they
> > aren't stuck from KVM's perspective.
> 
> Right, the traces look like they're not stuck (i.e. no looping on the same
> RIP). By "remain locked up" I mean that the VM is unresponsive on both the
> console and services (such as ssh) used to connect to it.
> 
> > But that doesn't mean the guest didn't take punitive action when a vCPU was
> > effectively stalled indefinitely by KVM, e.g. from the guest's perspective the
> > stuck vCPU will likely manifest as a soft lockup, and that could lead to a panic()
> > if the guest is a Linux kernel running with softlockup_panic=1.
> 
> So far we haven't had any guest kernels with softlockup_panic=1 have this issue,
> so it's hard to confirm, but it makes sense that the guest took punitive action
> in response to being stalled.
> 
> Any thoughts on how we might reproduce the issue or trace it down better?

Before going further, can you confirm that this earlier statement is correct?

 : Another interesting observation we made was that when we migrate a guest to a
 : different host, the guest _stays_ locked up and throws EPT violations on the new
 : host as well

Specifically, after migration, is the vCPU still fully stuck on EPT violations,
i.e. not making forward progress from KVM's perspective?  Or is the guest "stuck"
after migration purely because the guest itself gave up?