> > I've also run a `function_graph` trace on some of the affected hosts, if you > > think it might be helpful... > > It wouldn't hurt to see it. > Here you go: https://transfer.sh/SfXSCHp5xI/ept-function-graph.log > > Another interesting observation we made was that when we migrate a guest to a > > different host, the guest _stays_ locked up and throws EPT violations on the new > > host as well > > Ooh, that's *very* interesting. That pretty much rules out memslot and mmu_notifier > issues. Good to know, thanks! > What I suspect is happening is that get_user_pages_remote() fails for some reason, > i.e. the workqueue doesn't fault in the page, and the vCPU gets stuck trying to > fault in a page that can't be faulted in for whatever reason. AFAICT, nothing in > KVM will actually complain or even surface the problem in tracepoints (yeah, that's > not good). Thanks for the explanation, I did suspect something similar seeing how the page faults / EPT_VIOLATIONs tend to loop on the same eip/rip/instruction and address (not always, but quite often). > To mostly confirm this is likely what's happening, can you enable all of the async > #PF tracepoints in KVM? The exact tracepoints might vary dependending on which kernel > version you're running, just enable everything with "async" in the name, e.g. > > # ls -1 /sys/kernel/debug/tracing/events/kvm | grep async > kvm_async_pf_completed/ > kvm_async_pf_not_present/ > kvm_async_pf_ready/ > kvm_async_pf_repeated_fault/ > kvm_try_async_get_page/ > > If kvm_try_async_get_page() is more or less keeping pace with the "pf_taken" stat, > then this is likely what's happening. I did this and unfortunately, don't see any of these functions being called at all despite EPT_VIOLATIONs still being thrown and pf_taken still climbing. (Tried both with `trace-cmd -e ...` and using `bpftrace` and none of those functions are being called during the deadlock/guest being stuck.) > And then to really confirm, this small bpf program will yell if get_user_pages_remote() > fails when attempting get a single page (which is always the case for KVM's async > #PF usage). > > $ tail gup_remote.bt > kretfunc:get_user_pages_remote > { > if ( args->nr_pages == 1 && retval != 1 ) { > printf("Failed remote gup() on address %lx, ret = %d\n", args->start, retval); > } > } > Our hosts don't have kfunc/kretfunc support (`bpftrace --info` reports `kret: no`), but I tried just a kprobe to verify that get_user_pages_remote is being called at all - does not seem like it is, unfortunately: ``` # bpftrace -e 'kprobe:get_user_pages_remote { @[comm] = count(); }' Attaching 1 probe... ^C # ``` So I guess that disproves the async #PF theory? Any other potential causes you can think of, or anything we can try on faulting hosts that might help illuminate the issue further? Thanks for your time and help!