Re: [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode

Roman Kagan <rkagan@xxxxxxxxxxxxx> · Thu, 15 Dec 2016 09:55:17 +0300

On Wed, Dec 14, 2016 at 10:21:11PM +0100, Radim Krčmář wrote:
> 2016-12-12 17:32+0300, Roman Kagan:
> > Async pagefault machinery assumes communication with L1 guests only: all
> > the state -- MSRs, apf area addresses, etc, -- are for L1.  However, it
> > currently doesn't check if the vCPU is running L1 or L2, and may inject
> > a #PF into whatever context is currently executing.
> > 
> > In vmx this just results in crashing the L2 on bogus #PFs and hanging
> > tasks in L1 due to missing PAGE_READY async_pfs.  To reproduce it, use a
> > host with swap enabled, run a VM on it, run a nested VM on top, and set
> > RSS limit for L1 on the host via
> > /sys/fs/cgroup/memory/machine.slice/machine-*.scope/memory.limit_in_bytes
> > to swap it out (you may need to tighten and loosen it once or twice, or
> > create some memory load inside L1).  Very quickly L2 guest starts
> > receiving pagefaults with bogus %cr2 (apf tokens from the host
> > actually), and L1 guest starts accumulating tasks stuck in D state in
> > kvm_async_pf_task_wait.
> > 
> > In svm such #PFs are converted into vmexit from L2 to L1 on #PF which is
> > then handled by L1 similar to ordinary async_pf.  However this only
> > works with KVM running in L1; another hypervisor may not expect this
> > (e.g.  VirtualBox asserts on #PF vmexit when NPT is on).
> 
> async_pf is an optional paravirtual device.  It is L1's fault if it
> enabled something that it doesn't support ...

async_pf in L1 is enabled by the core Linux; the hypervisor may be
third-party and have no control over it.

> AMD's behavior makes sense and already works, therefore I'd like to see
> the same on Intel as well.  (I thought that SVM was broken as well,
> sorry for my misleading first review.)
> 
> > To avoid that, only do async_pf stuff when executing L1 guest.
> 
> The good thing is that we are already killing VMX L1 with async_pf, so
> regressions don't prevent us from making Intel KVM do the same as AMD:
> force a nested VM exit from nested_vmx_check_exception() if the injected
> #PF is async_pf and handle the #PF VM exit in L1.

I'm not getting your point: the wealth of existing hypervisors running
in L1 which don't take #PF vmexits can be made not to hang or crash
their guests with a not so complex fix in L0 hypervisor.  Why do the
users need to update *both* their L0 and L1 hypervisors instead?

> I remember that you already implemented this and chose not to post it --
> were there other problems than asserts in current KVM/VirtualBox?

You must have confused me with someone else ;) I didn't implement this;
moreover I tend to think that L1 hypervisor cooperation is unnecessary
and the fix can be done in L0 only.

Roman.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html