[PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode

Roman Kagan <rkagan@xxxxxxxxxxxxx> · Mon, 12 Dec 2016 17:32:21 +0300

Async pagefault machinery assumes communication with L1 guests only: all
the state -- MSRs, apf area addresses, etc, -- are for L1.  However, it
currently doesn't check if the vCPU is running L1 or L2, and may inject
a #PF into whatever context is currently executing.

In vmx this just results in crashing the L2 on bogus #PFs and hanging
tasks in L1 due to missing PAGE_READY async_pfs.  To reproduce it, use a
host with swap enabled, run a VM on it, run a nested VM on top, and set
RSS limit for L1 on the host via
/sys/fs/cgroup/memory/machine.slice/machine-*.scope/memory.limit_in_bytes
to swap it out (you may need to tighten and loosen it once or twice, or
create some memory load inside L1).  Very quickly L2 guest starts
receiving pagefaults with bogus %cr2 (apf tokens from the host
actually), and L1 guest starts accumulating tasks stuck in D state in
kvm_async_pf_task_wait.

In svm such #PFs are converted into vmexit from L2 to L1 on #PF which is
then handled by L1 similar to ordinary async_pf.  However this only
works with KVM running in L1; another hypervisor may not expect this
(e.g.  VirtualBox asserts on #PF vmexit when NPT is on).

To avoid that, only do async_pf stuff when executing L1 guest.

Note: this patch only fixes x86; other async_pf-capable arches may also
need something similar.

Signed-off-by: Roman Kagan <rkagan@xxxxxxxxxxxxx>
---
v1 -> v2:
 - more verbose commit log

 arch/x86/kvm/mmu.c | 2 +-
 arch/x86/kvm/x86.c | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d9c7e98..cdafc61 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3510,7 +3510,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 	if (!async)
 		return false; /* *pfn has correct page already */
 
-	if (!prefault && can_do_async_pf(vcpu)) {
+	if (!prefault && !is_guest_mode(vcpu) && can_do_async_pf(vcpu)) {
 		trace_kvm_try_async_get_page(gva, gfn);
 		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
 			trace_kvm_async_pf_doublefault(gva, gfn);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 04c5d96..bf11fe4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6864,7 +6864,8 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 			break;
 		}
 
-		kvm_check_async_pf_completion(vcpu);
+		if (!is_guest_mode(vcpu))
+			kvm_check_async_pf_completion(vcpu);
 
 		if (signal_pending(current)) {
 			r = -EINTR;
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html