> There's a pretty big list, see the "failure" paths of do_numa_page() and > migrate_misplaced_page(). Gotcha, thank you! ... > Since it sounds like you can test with a custom kernel, try running with this > patch and then enable the kvm_page_fault tracepoint when a vCPU gets > stuck. The below expands said tracepoint to capture information about > mmu_notifiers and memslots generation. With luck, it will reveal a smoking > gun. Thanks for the patch there. We tried migrating a locked up guest to a host with this modified kernel twice (logs below). The guest "fixed itself" post migration, so the results may not have captured the "problematic" kind of page-fault, but here they are. Complete logs of kvm_page_fault tracepoint events, starting just before the migration (with 0 guests before the migration, so the first logs should be of the problematic guest) as it resolves the lockup: 1. https://transfer.sh/QjB3MjeBqh/trace-kvm-kpf2.log 2. https://transfer.sh/wEFQm4hLHs/trace-kvm-pf.log Truncated logs of `trace-cmd record -e kvm -e kvmmmu` in case context helps: 1. https://transfer.sh/FoFsNoFQCP/trace-kvm2.log 2. https://transfer.sh/LBFJryOfu7/trace-kvm.log Note that for migration #2 in both respectively above (trace-kvm-pf.log and trace-kvm.log), we didn't confirm that the guest was locked up before migration mistakenly. It most likely was but in case trace #2 doesn't present the same symptoms, that's why. Off an uneducated glance, it seems like `in_prog = 0x1` at least once for every `seq` / kvm_page_fault that seems to be "looping" and staying unresolved - indicating a lock contention, perhaps, in trying to invalidate/read/write the same page range? Any leads on where in the source code I could look to understand how that might happen? ---- @Eric > Does the VM make progress even if is migrated to a kernel that presents the > bug? We're unsure which kernel versions do present the bug, so it's hard to say. We've definitely seen it occur on kernels 5.15.49 to 6.1.38, but beyond that, we don't know for certain. (Potentially as early as 5.10.103, though!) > What was kernel version being migrated from and to? The live migration where the issue was resolved by migrating, was from 6.1.12 to 6.5.0-rc2. The traces above are for this live migration (source 6.1.x to target host 6.5.0-rc2). Another migration was from 6.1.x to 6.1.39 (not for these traces). All of these times the guest resumed/made progress post-migration. > For example, was it from a >5.19 kernel to something earlier than 5.19? No, we haven't tried migrating to < 5.19 yet - we have very few hosts running kernels that old. > For example, if the hung VM remains stuck after migrating to a >5.19 kernel > but _not_ to a <5.19 kernel, then maybe bisect is an option. >From what Sean and I discussed above, we suspect that the VM remaining stuck is likely due to the kernel softlock'ing from stalling in the kernel due to the original bug. We do know this issue _occurs_ as late as 6.1.38 at least (i.e. hosts running 6.1.38 have had guests lockup - we don't have hosts on more recent kernels, so this isn't proof that it's been fixed since then, nor is migration proof of that, IMO).