> LOL, NUMA autobalancing. I have a longstanding hatred of that feature. I'm sure > there are setups where it adds value, but from my perspective it's nothing but > pain and misery. Do you think autobalancing is increasing the odds of some edge-case race condition, perhaps? I find it really curious that numa_balancing definitely affects this issue, but particularly when thp=0. Is it just too many EPT entries to install when transparent hugepages is disabled, increasing the likelihood of a race condition / lock contention of some sort? > > They still remain locked up, but that might be because the original cause of the > > looping EPT_VIOLATIONs corrupted/crashed them in an unrecoverable way (are there > > any ways you can think of that that might happen)? > > Define "remain locked up". If the vCPUs are actively running in the guest and > making forward progress, i.e. not looping on VM-Exits on a single RIP, then they > aren't stuck from KVM's perspective. Right, the traces look like they're not stuck (i.e. no looping on the same RIP). By "remain locked up" I mean that the VM is unresponsive on both the console and services (such as ssh) used to connect to it. > But that doesn't mean the guest didn't take punitive action when a vCPU was > effectively stalled indefinitely by KVM, e.g. from the guest's perspective the > stuck vCPU will likely manifest as a soft lockup, and that could lead to a panic() > if the guest is a Linux kernel running with softlockup_panic=1. So far we haven't had any guest kernels with softlockup_panic=1 have this issue, so it's hard to confirm, but it makes sense that the guest took punitive action in response to being stalled. Any thoughts on how we might reproduce the issue or trace it down better? Anything look suspect in the function_graph trace? (Note that this was on a host that had numa_balancing=0,thp=1 from before the guest booted, and it still ended up in the EPT_VIOLATION loop and "locked up" (unresponsive on console).)