On Fri, Aug 11, 2023, Amaan Cheval wrote: > > Since it sounds like you can test with a custom kernel, try running with this > > patch and then enable the kvm_page_fault tracepoint when a vCPU gets > > stuck. The below expands said tracepoint to capture information about > > mmu_notifiers and memslots generation. With luck, it will reveal a smoking > > gun. > > Thanks for the patch there. We tried migrating a locked up guest to a host with > this modified kernel twice (logs below). The guest "fixed itself" post > migration, so the results may not have captured the "problematic" kind of > page-fault, but here they are. The traces need to be captured from the host where a vCPU is stuck. > Complete logs of kvm_page_fault tracepoint events, starting just before the > migration (with 0 guests before the migration, so the first logs should be of > the problematic guest) as it resolves the lockup: > > 1. https://transfer.sh/QjB3MjeBqh/trace-kvm-kpf2.log > 2. https://transfer.sh/wEFQm4hLHs/trace-kvm-pf.log > > Truncated logs of `trace-cmd record -e kvm -e kvmmmu` in case context helps: > > 1. https://transfer.sh/FoFsNoFQCP/trace-kvm2.log > 2. https://transfer.sh/LBFJryOfu7/trace-kvm.log > > Note that for migration #2 in both respectively above (trace-kvm-pf.log and > trace-kvm.log), we didn't confirm that the guest was locked up before migration > mistakenly. It most likely was but in case trace #2 doesn't present the same > symptoms, that's why. > > Off an uneducated glance, it seems like `in_prog = 0x1` at least once for every > `seq` / kvm_page_fault that seems to be "looping" and staying unresolved - This is completely expected. The "in_prog" thing is just saying that a vCPU took a fault while there was an mmu_notifier event in-progress. > indicating a lock contention, perhaps, in trying to invalidate/read/write the > same page range? No, just a collision between the primary MMU invalidating something, e.g. to move a page or do KSM stuff, and a vCPU accessing the page in question. > We do know this issue _occurs_ as late as 6.1.38 at least (i.e. hosts running > 6.1.38 have had guests lockup - we don't have hosts on more recent kernels, so > this isn't proof that it's been fixed since then, nor is migration proof of > that, IMO). Note, if my hunch is correct, it's the act of migrating to a different *host* that resolves the problem, not the fact that the migration is to a different kernel. E.g. I would expect that migrating to the exact same kernel would still unstick the vCPU. What I suspect is happening is that the in-progress count gets left high, e.g. because of a start() without a paired end(), and that causes KVM to refuse to install mappings for the affected range of guest memory. Or possibly that the problematic host is generating an absolutely massive storm of invalidations and unintentionally DoS's the guest. Either way, migrating the VM to a new host and thus a new KVM instance essentially resets all of that metadata and allows KVM to fault-in pages and establish mappings. Actually, one thing you could try to unstick a VM would be to do an intra-host migration, i.e. migrate it to a new KVM instance on the same host. If that "fixes" the guest, then the bug is likely an mmu_notifier counting bug and not an invalidation storm. But the easiest thing would be to catch a host in the act, i.e. capture traces with my debug patch from a host with a stuck vCPU.