Re: Deadlock due to EPT_VIOLATION

Sean Christopherson <seanjc@xxxxxxxxxx> · Fri, 21 Jul 2023 10:37:22 -0700

On Fri, Jul 21, 2023, Amaan Cheval wrote:
> I've also run a `function_graph` trace on some of the affected hosts, if you
> think it might be helpful to have a look at that to see what the host kernel
> might be doing while the guests are looping on EPT_VIOLATIONs. Nothing obvious
> stands out to me right now.

It wouldn't hurt to see it.

> We suspected KSM briefly, but ruled that out by turning KSM off and unmerging
> KSM pages - after doing that, a guest VM still locked up / started looping
> EPT_VIOLATIONS (like in Brian's original email), so it's unlikely this is KSM specific.
> 
> Another interesting observation we made was that when we migrate a guest to a
> different host, the guest _stays_ locked up and throws EPT violations on the new
> host as well 

Ooh, that's *very* interesting.  That pretty much rules out memslot and mmu_notifier
issues.

>- so it's unlikely the issue is in the guest kernel itself (since
> we see it across guest operating systems), but perhaps the host kernel is
> messing the state of the guest kernel up in a way that keeps it locked up after
> migrating as well?
> 
> If you have any thoughts on anything else to try, let me know!

Good news and bad news.  Good news: I have a plausible theory as to what might be
going wrong.  Bad news: if my theory is correct, our princess is in another castle
(the bug isn't in KVM).

One of the scenario where KVM retries page faults is if KVM asynchronously faults-in
the host backing page.  If faulting in the page would require I/O, e.g. because
it's been swapped out, instead of synchronously doing the I/O on the vCPU task,
KVM uses a workqueue to fault in the page and immediately resumes the guest.

There are a variety of conditions that must be met to try an async page fault, but
assuming you aren't disable HLT VM-Exit, i.e. aren't letting the guest execute HLT,
it really just boils down to IRQs being enabled in the guest, which looking at the
traces is pretty much guaranteed to be true.

What's _supposed_ to happen is that async_pf_execute() successfully faults in the
page via get_user_pages_remote(), and then KVM installs a mapping for the guest
either in kvm_arch_async_page_ready() or by resuming the guest and cleanly handling
the retried guest page fault.

What I suspect is happening is that get_user_pages_remote() fails for some reason,
i.e. the workqueue doesn't fault in the page, and the vCPU gets stuck trying to
fault in a page that can't be faulted in for whatever reason.  AFAICT, nothing in
KVM will actually complain or even surface the problem in tracepoints (yeah, that's
not good).

Circling back to the bad news, if that's indeed what's happening, it likely means
there's a bug somewhere else in the stack.  E.g. it could be core mm/, might be
in the block layer, in swap, possibly in the exact filesystem you're using, etc.

Note, there's also a paravirt extension to async #PFs, where instead of putting
the vCPU into a synthetic halted state, KVM instead *may* inject a synthetic #PF
into the guest, e.g. so that the guest can go run a different task while the
faulting task is blocked.  But this really is just a note, guest enabling of PV
async #PF shouldn't actually matter, again assuming my theory is correct.

To mostly confirm this is likely what's happening, can you enable all of the async
#PF tracepoints in KVM?  The exact tracepoints might vary dependending on which kernel
version you're running, just enable everything with "async" in the name, e.g.

  # ls -1 /sys/kernel/debug/tracing/events/kvm | grep async
  kvm_async_pf_completed/
  kvm_async_pf_not_present/
  kvm_async_pf_ready/
  kvm_async_pf_repeated_fault/
  kvm_try_async_get_page/

If kvm_try_async_get_page() is more or less keeping pace with the "pf_taken" stat,
then this is likely what's happening.

And then to really confirm, this small bpf program will yell if get_user_pages_remote()
fails when attempting get a single page (which is always the case for KVM's async
#PF usage).

FWIW, get_user_pages_remote() isn't used all that much, e.g. when running a VM in
my, KVM is the only user.  So you can likely aggressively instrument
get_user_pages_remote() via bpf without major problems, or maybe even assume that
any call is from KVM.

$ tail gup_remote.bt 
kretfunc:get_user_pages_remote
{
        if ( args->nr_pages == 1 && retval != 1 ) {
                printf("Failed remote gup() on address %lx, ret = %d\n", args->start, retval);
        }
}