Hey Sean, > If NUMA balancing is going nuclear and constantly zapping PTEs, the resulting > mmu_notifier events could theoretically stall a vCPU indefinitely. The reason I > dislike NUMA balancing is that it's all too easy to end up with subtle bugs > and/or misconfigured setups where the NUMA balancing logic zaps PTEs/SPTEs without > actuablly being able to move the page in the end, i.e. it's (IMO) too easy for > NUMA balancing to get false positives when determining whether or not to try and > migrate a page. What are some situations where it might not be able to move the page in the end? > That said, it's definitely very unexpected that NUMA balancing would be zapping > SPTEs to the point where a vCPU can't make forward progress. It's theoretically > possible that that's what's happening, but quite unlikely, especially since it > sounds like you're seeing issues even with NUMA balancing disabled. Yep, we're definitely seeing the issue occur even with numa_balancing enabled, but the likelihood of it occurring has significantly dropped since we've disabled numa_balancing. > More likely is that there is a bug somewhere that results in the mmu_notifier > event refcount staying incorrectly eleveated, but that type of bug shouldn't follow > the VM across a live migration... Good news! We managed to live migrate a guest and that did "fix it". The console was locked-up on the login screen before migration for about 6.5 hours, looping EPT_VIOLATIONs. Post migration, we saw `rcu_shed detected stalls on CPUs/tasks` on the console, and then the VM resumed normal operation. Here's a screenshot of the console (it was "locked up"/frozen on the login screen until the migration): https://i.imgur.com/n6CSsAv.png > [*] Not technically a full zap of the PTE, it's just marked PROT_NONE, i.e. > !PRESET, but on the KVM side of things it does manifest as a full zap of the > SPTE. Thank you so much for that detailed explanation! A colleague also modified a host kernel with KFI (Kernel Function Instrumentation) and wrote a kernel module to intercept the vmexit handler, handle_ept_violation, and does an EPT walk for each pfn, compared against /proc/iomem. Assuming the EPT walking code is correct, we see this surprising result of a PDPTE's pfn=0: ``` [15295.792019] kvm-kfi: enter: handle_ept_violation [15295.792021] kvm-kfi: ept walk: eptp=0x103aaa05e gpa=0x792d4ff8 [15295.792023] PML4E : [0x103aaa05e] pfn=0x103aaa : is within the range: 0x100000-0x3fffffff: System RAM [15295.792026] PDPTE : [0x0] pfn=0x0 : is within the range: 0x0-0xfff: Reserved [15295.792029] PDE : [0xf000eef3f000e2c3] pfn=0xeef3f000e [large] : is within the range: 0x100000000-0x1075ffffff: System RAM ``` For comparison, the same module's output on a host without any "locked up" guests: ``` [13956.578732] kvm-kfi: ept walk: eptp=0x1061b505e gpa=0xfcf28 [13956.578733] PML4E : [0x1061b505e] pfn=0x1061b5 : is within the range: 0x100000-0x3fffffff: System RAM [13956.578736] PDPTE : [0x11f29a907] pfn=0x11f29a : is within the range: 0x100000-0x3fffffff: System RAM [13956.578739] PDE : [0x11c205907] pfn=0x11c205 : is within the range: 0x100000-0x3fffffff: System RAM [13956.578741] PTE : [0x11c204907] pfn=0x11c204 : is within the range: 0x100000-0x3fffffff: System RAM ``` Does this seem to indicate an mmu_notifier refcount issue to you, given that migration did fix it? Any way to verify? We haven't found any guests with `softlockup_panic=1` yet, and since we can't reproduce the issue on command ourselves yet, we might have to wait a bit - but I imagine that the fact that live migration "fixed" the locked up guest confirms that the other guests that didn't get "fixed" were likely softlocked from the CPU stalling? If you have any suggestions on how modifying the host kernel (and then migrating a locked up guest to it) or eBPF programs that might help illuminate the issue further, let me know! Thanks for all your help so far!