> Yeesh. There is a ridiculous amount of potentially problematic activity. KSM is > active in that trace, it looks like NUMA balancing might be in play, Sorry about the delayed response - it seems like the majority of locked up guest VMs stop throwing repeated EPT_VIOLATIONs as soon as we turn `numa_balancing` off. They still remain locked up, but that might be because the original cause of the looping EPT_VIOLATIONs corrupted/crashed them in an unrecoverable way (are there any ways you can think of that that might happen)? ---- We experimented with numa_balancing + transparent hugepage settings in certain data centers (to determine if the settings make the lockups disappear) and the incidence rate of locked up guests has lowered significantly for the numa_balancing=0 and thp=1 case, but numa_balancing=0 and thp=0 are still locking up / looping on EPT_VIOLATIONs at about the same rate (or slightly lower than both numa_balancing=thp=1). Here's a function_graph of a host which had numa_balancing=0, thp=1, ksm=2 (KSM unloaded and unmerged after it was initially on): https://transfer.sh/M4WdfxaTJs/ept-fn-graph.log ``` # bpftrace -e 'kprobe:handle_ept_violation { @ept[comm] = count(); } tracepoint:kvm:kvm_page_fault { @pf[comm] = count(); }' Attaching 2 probes... ^C @ept[CPU 0/KVM]: 52 @ept[CPU 3/KVM]: 61 @ept[CPU 2/KVM]: 112 @ept[CPU 1/KVM]: 257 @pf[CPU 0/KVM]: 52 @pf[CPU 3/KVM]: 61 @pf[CPU 2/KVM]: 111 @pf[CPU 1/KVM]: 262 ``` > there might be hugepage shattering, etc. Is there a BPF program / another way we can confirm this is the case? I think the fact that guests lockup at about the same rate when thp=0,numa_balancing=0 as thp=1,numa_balancing=1 is interesting and relevant. Only thp=1,numa_balancing=0 seems to have the least guests locking up. > Let me rephrase that statement: it rules out a certain class of memslot and > mmu_notifier bugs, namely bugs where KVM would incorrect leave an invalidation > refcount (for lack of a better term) elevated. It doesn't mean memslot changes > and/or mmu_notifier events aren't at fault. I see, thanks! > kernel bug, e.g. it's possible the vCPU is stuck purely because it's being trashed > to the point where it can't make forward progress. Given that the guest stays locked-up post-migration on a completely unloaded host, I think this is unlikely unless the thrashing also corrupts the guests' state before the migration somehow? > Yeah. Definitely not related async page fault. I guess the biggest lead currently is why `numa_balancing=1` increases the odds of this issue occurring, and why is it specifically more likely with transparent hugepages off (`thp=0`)? To be clear, the lockups occur in all configurations we've tried so far, so none of these are likely the direct cause, just relevant factors. If there's any changes in the kernel that might help illuminate the issue further, we can run a custom kernel and migrate a guest to the modified host - let me know if there's anything that might help!