On Mon, Jul 24, 2023, Amaan Cheval wrote: > > > I've also run a `function_graph` trace on some of the affected hosts, if you > > > think it might be helpful... > > > > It wouldn't hurt to see it. > > > > Here you go: > https://transfer.sh/SfXSCHp5xI/ept-function-graph.log Yeesh. There is a ridiculous amount of potentially problematic activity. KSM is active in that trace, it looks like NUMA balancing might be in play, there might be hugepage shattering, etc. > > > Another interesting observation we made was that when we migrate a guest to a > > > different host, the guest _stays_ locked up and throws EPT violations on the new > > > host as well > > > > Ooh, that's *very* interesting. That pretty much rules out memslot and mmu_notifier > > issues. > > Good to know, thanks! Let me rephrase that statement: it rules out a certain class of memslot and mmu_notifier bugs, namely bugs where KVM would incorrect leave an invalidation refcount (for lack of a better term) elevated. It doesn't mean memslot changes and/or mmu_notifier events aren't at fault. Can you migrate a hung guest to a host that is completely unloaded? And ideally, disable KSM and NUMA autobalancing on the target host. And then get a function_graph trace on that host, assuming the vCPU remains stuck. There is *so* much going on in the above graph that it's impossible to determine if there's a kernel bug, e.g. it's possible the vCPU is stuck purely because it's being trashed to the point where it can't make forward progress. > > To mostly confirm this is likely what's happening, can you enable all of the async > > #PF tracepoints in KVM? The exact tracepoints might vary dependending on which kernel > > version you're running, just enable everything with "async" in the name, e.g. > > > > # ls -1 /sys/kernel/debug/tracing/events/kvm | grep async > > kvm_async_pf_completed/ > > kvm_async_pf_not_present/ > > kvm_async_pf_ready/ > > kvm_async_pf_repeated_fault/ > > kvm_try_async_get_page/ > > > > If kvm_try_async_get_page() is more or less keeping pace with the "pf_taken" stat, > > then this is likely what's happening. > > I did this and unfortunately, don't see any of these functions being > called at all despite > EPT_VIOLATIONs still being thrown and pf_taken still climbing. (Tried both with > `trace-cmd -e ...` and using `bpftrace` and none of those functions > are being called > during the deadlock/guest being stuck.) Well fudge. > > And then to really confirm, this small bpf program will yell if get_user_pages_remote() > > fails when attempting get a single page (which is always the case for KVM's async > > #PF usage). > > > > $ tail gup_remote.bt > > kretfunc:get_user_pages_remote > > { > > if ( args->nr_pages == 1 && retval != 1 ) { > > printf("Failed remote gup() on address %lx, ret = %d\n", args->start, retval); > > } > > } > > > > Our hosts don't have kfunc/kretfunc support (`bpftrace --info` reports > `kret: no`), > but I tried just a kprobe to verify that get_user_pages_remote is > being called at all - > does not seem like it is, unfortunately: > > ``` > # bpftrace -e 'kprobe:get_user_pages_remote { @[comm] = count(); }' > Attaching 1 probe... > ^C > # > ``` > > So I guess that disproves the async #PF theory? Yeah. Definitely not related async page fault.