Re: Deadlock due to EPT_VIOLATION

Amaan Cheval <amaan.cheval@xxxxxxxxx> · Fri, 11 Aug 2023 18:07:53 +0530

> There's a pretty big list, see the "failure" paths of do_numa_page() and
> migrate_misplaced_page().

Gotcha, thank you!

...

> Since it sounds like you can test with a custom kernel, try running with this
> patch and then enable the kvm_page_fault tracepoint when a vCPU gets
> stuck.  The below expands said tracepoint to capture information about
> mmu_notifiers and memslots generation.  With luck, it will reveal a smoking
> gun.

Thanks for the patch there. We tried migrating a locked up guest to a host with
this modified kernel twice (logs below). The guest "fixed itself" post
migration, so the results may not have captured the "problematic" kind of
page-fault, but here they are.

Complete logs of kvm_page_fault tracepoint events, starting just before the
migration (with 0 guests before the migration, so the first logs should be of
the problematic guest) as it resolves the lockup:

1. https://transfer.sh/QjB3MjeBqh/trace-kvm-kpf2.log
2. https://transfer.sh/wEFQm4hLHs/trace-kvm-pf.log

Truncated logs of `trace-cmd record -e kvm -e kvmmmu` in case context helps:

1. https://transfer.sh/FoFsNoFQCP/trace-kvm2.log
2. https://transfer.sh/LBFJryOfu7/trace-kvm.log

Note that for migration #2 in both respectively above (trace-kvm-pf.log and
trace-kvm.log), we didn't confirm that the guest was locked up before migration
mistakenly. It most likely was but in case trace #2 doesn't present the same
symptoms, that's why.

Off an uneducated glance, it seems like `in_prog = 0x1` at least once for every
`seq` / kvm_page_fault that seems to be "looping" and staying unresolved -
indicating a lock contention, perhaps, in trying to invalidate/read/write the
same page range?

Any leads on where in the source code I could look to understand how that might
happen?

----

@Eric

> Does the VM make progress even if is migrated to a kernel that presents the
> bug?

We're unsure which kernel versions do present the bug, so it's hard to say.
We've definitely seen it occur on kernels 5.15.49 to 6.1.38, but beyond that, we
don't know for certain. (Potentially as early as 5.10.103, though!)

> What was kernel version being migrated from and to?

The live migration where the issue was resolved by migrating, was from 6.1.12 to
6.5.0-rc2.

The traces above are for this live migration (source 6.1.x to target host
6.5.0-rc2).

Another migration was from 6.1.x to 6.1.39 (not for these traces). All of these
times the guest resumed/made progress post-migration.

> For example, was it from a >5.19 kernel to something earlier than 5.19?

No, we haven't tried migrating to < 5.19 yet - we have very few hosts running
kernels that old.

> For example, if the hung VM remains stuck after migrating to a >5.19 kernel
> but _not_ to a <5.19 kernel, then maybe bisect is an option.

>From what Sean and I discussed above, we suspect that the VM remaining stuck is
likely due to the kernel softlock'ing from stalling in the kernel due to the
original bug.

We do know this issue _occurs_ as late as 6.1.38 at least (i.e. hosts running
6.1.38 have had guests lockup - we don't have hosts on more recent kernels, so
this isn't proof that it's been fixed since then, nor is migration proof of
that, IMO).