Re: [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode

Roman Kagan <rkagan@xxxxxxxxxxxxx> · Mon, 19 Dec 2016 10:18:11 +0300

On Thu, Dec 15, 2016 at 04:09:39PM +0100, Radim Krčmář wrote:
> 2016-12-15 09:55+0300, Roman Kagan:
> > On Wed, Dec 14, 2016 at 10:21:11PM +0100, Radim Krčmář wrote:
> >> 2016-12-12 17:32+0300, Roman Kagan:
> >> > Async pagefault machinery assumes communication with L1 guests only: all
> >> > the state -- MSRs, apf area addresses, etc, -- are for L1.  However, it
> >> > currently doesn't check if the vCPU is running L1 or L2, and may inject
> >> > a #PF into whatever context is currently executing.
> >> > 
> >> > In vmx this just results in crashing the L2 on bogus #PFs and hanging
> >> > tasks in L1 due to missing PAGE_READY async_pfs.  To reproduce it, use a
> >> > host with swap enabled, run a VM on it, run a nested VM on top, and set
> >> > RSS limit for L1 on the host via
> >> > /sys/fs/cgroup/memory/machine.slice/machine-*.scope/memory.limit_in_bytes
> >> > to swap it out (you may need to tighten and loosen it once or twice, or
> >> > create some memory load inside L1).  Very quickly L2 guest starts
> >> > receiving pagefaults with bogus %cr2 (apf tokens from the host
> >> > actually), and L1 guest starts accumulating tasks stuck in D state in
> >> > kvm_async_pf_task_wait.
> >> > 
> >> > In svm such #PFs are converted into vmexit from L2 to L1 on #PF which is
> >> > then handled by L1 similar to ordinary async_pf.  However this only
> >> > works with KVM running in L1; another hypervisor may not expect this
> >> > (e.g.  VirtualBox asserts on #PF vmexit when NPT is on).
> >> 
> >> async_pf is an optional paravirtual device.  It is L1's fault if it
> >> enabled something that it doesn't support ...
> > 
> > async_pf in L1 is enabled by the core Linux; the hypervisor may be
> > third-party and have no control over it.
> 
> Admin can pass no-kvmapf to Linux when planning to use a hypervisor that
> doesn't support paravirtualized async_pf.  Linux allows only in-kernel
> hypervisors that do have full control over it.

Imagine you are a hoster providing VPSes to your customers.  You have
basically no control over what they run there.  Now if you are brave
enough to enable nested, you most certainly won't want async_pf to
create problems for your customers only because they have a kernel with
async_pf support and a hypervisor without (which at the moment means a
significant fraction of VPS owners).

> >> AMD's behavior makes sense and already works, therefore I'd like to see
> >> the same on Intel as well.  (I thought that SVM was broken as well,
> >> sorry for my misleading first review.)
> >> 
> >> > To avoid that, only do async_pf stuff when executing L1 guest.
> >> 
> >> The good thing is that we are already killing VMX L1 with async_pf, so
> >> regressions don't prevent us from making Intel KVM do the same as AMD:
> >> force a nested VM exit from nested_vmx_check_exception() if the injected
> >> #PF is async_pf and handle the #PF VM exit in L1.
> > 
> > I'm not getting your point: the wealth of existing hypervisors running
> > in L1 which don't take #PF vmexits can be made not to hang or crash
> > their guests with a not so complex fix in L0 hypervisor.  Why do the
> > users need to update *both* their L0 and L1 hypervisors instead?
> 
> L1 enables paravirtual async_pf to get notified about L0 page faults,
> which would allow L1 to reschedule the blocked process and get better
> performance.  Running a guest is just another process in L1, hence we
> can assume that L1 is interested in being notified.

That's a nice theory but in practice there is a fair amount of installed
VMs with a kernel that requests async_pf and a hypervisor that can't
live with it.

> If you want a fix without changing L1 hypervisors, then you need to
> regress KVM on SVM.

I don't buy this argument.  I don't see any significant difference from
L0's viewpoint between emulating a #PF vmexit and emulating an external
interrupt vmexit combined with #PF injection into L1.  The latter,
however, will keep L1 getting along just fine with the existing kernels
and hypervisors.

> This series regresses needlessly, though -- it forces L1 to wait in L2
> until the page for L2 is fetched by L0.

Indeed, it's half-baked.  I also just realized that it incorrectly does
nested vmexit before L1 vmentry but #PF injection is attempted on the
next round which defeats the whole purpose.  I'll rework the series once
I have the time (hopefully before x-mas).

Thanks,
Roman.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html