Re: Deadlock due to EPT_VIOLATION

Amaan Cheval <amaan.cheval@xxxxxxxxx> · Mon, 24 Jul 2023 17:38:58 +0530

> > I've also run a `function_graph` trace on some of the affected hosts, if you
> > think it might be helpful...
>
> It wouldn't hurt to see it.
>

Here you go:
https://transfer.sh/SfXSCHp5xI/ept-function-graph.log

> > Another interesting observation we made was that when we migrate a guest to a
> > different host, the guest _stays_ locked up and throws EPT violations on the new
> > host as well
>
> Ooh, that's *very* interesting.  That pretty much rules out memslot and mmu_notifier
> issues.

Good to know, thanks!

> What I suspect is happening is that get_user_pages_remote() fails for some reason,
> i.e. the workqueue doesn't fault in the page, and the vCPU gets stuck trying to
> fault in a page that can't be faulted in for whatever reason.  AFAICT, nothing in
> KVM will actually complain or even surface the problem in tracepoints (yeah, that's
> not good).

Thanks for the explanation, I did suspect something similar seeing how the page
faults / EPT_VIOLATIONs tend to loop on the same eip/rip/instruction and address
(not always, but quite often).

> To mostly confirm this is likely what's happening, can you enable all of the async
> #PF tracepoints in KVM?  The exact tracepoints might vary dependending on which kernel
> version you're running, just enable everything with "async" in the name, e.g.
>
>   # ls -1 /sys/kernel/debug/tracing/events/kvm | grep async
>   kvm_async_pf_completed/
>   kvm_async_pf_not_present/
>   kvm_async_pf_ready/
>   kvm_async_pf_repeated_fault/
>   kvm_try_async_get_page/
>
> If kvm_try_async_get_page() is more or less keeping pace with the "pf_taken" stat,
> then this is likely what's happening.

I did this and unfortunately, don't see any of these functions being
called at all despite
EPT_VIOLATIONs still being thrown and pf_taken still climbing. (Tried both with
`trace-cmd -e ...` and using `bpftrace` and none of those functions
are being called
during the deadlock/guest being stuck.)

> And then to really confirm, this small bpf program will yell if get_user_pages_remote()
> fails when attempting get a single page (which is always the case for KVM's async
> #PF usage).
>
> $ tail gup_remote.bt
> kretfunc:get_user_pages_remote
> {
>         if ( args->nr_pages == 1 && retval != 1 ) {
>                 printf("Failed remote gup() on address %lx, ret = %d\n", args->start, retval);
>         }
> }
>

Our hosts don't have kfunc/kretfunc support (`bpftrace --info` reports
`kret: no`),
but I tried just a kprobe to verify that get_user_pages_remote is
being called at all -
does not seem like it is, unfortunately:

```
# bpftrace -e 'kprobe:get_user_pages_remote { @[comm] = count(); }'
Attaching 1 probe...
^C
#
```

So I guess that disproves the async #PF theory? Any other potential causes you
can think of, or anything we can try on faulting hosts that might help
illuminate the
issue further?

Thanks for your time and help!