2016-12-19 10:18+0300, Roman Kagan: > On Thu, Dec 15, 2016 at 04:09:39PM +0100, Radim Krčmář wrote: >> 2016-12-15 09:55+0300, Roman Kagan: >> > On Wed, Dec 14, 2016 at 10:21:11PM +0100, Radim Krčmář wrote: >> >> 2016-12-12 17:32+0300, Roman Kagan: >> >> > Async pagefault machinery assumes communication with L1 guests only: all >> >> > the state -- MSRs, apf area addresses, etc, -- are for L1. However, it >> >> > currently doesn't check if the vCPU is running L1 or L2, and may inject >> >> > a #PF into whatever context is currently executing. >> >> > >> >> > In vmx this just results in crashing the L2 on bogus #PFs and hanging >> >> > tasks in L1 due to missing PAGE_READY async_pfs. To reproduce it, use a >> >> > host with swap enabled, run a VM on it, run a nested VM on top, and set >> >> > RSS limit for L1 on the host via >> >> > /sys/fs/cgroup/memory/machine.slice/machine-*.scope/memory.limit_in_bytes >> >> > to swap it out (you may need to tighten and loosen it once or twice, or >> >> > create some memory load inside L1). Very quickly L2 guest starts >> >> > receiving pagefaults with bogus %cr2 (apf tokens from the host >> >> > actually), and L1 guest starts accumulating tasks stuck in D state in >> >> > kvm_async_pf_task_wait. >> >> > >> >> > In svm such #PFs are converted into vmexit from L2 to L1 on #PF which is >> >> > then handled by L1 similar to ordinary async_pf. However this only >> >> > works with KVM running in L1; another hypervisor may not expect this >> >> > (e.g. VirtualBox asserts on #PF vmexit when NPT is on). >> >> >> >> async_pf is an optional paravirtual device. It is L1's fault if it >> >> enabled something that it doesn't support ... >> > >> > async_pf in L1 is enabled by the core Linux; the hypervisor may be >> > third-party and have no control over it. >> >> Admin can pass no-kvmapf to Linux when planning to use a hypervisor that >> doesn't support paravirtualized async_pf. Linux allows only in-kernel >> hypervisors that do have full control over it. > > Imagine you are a hoster providing VPSes to your customers. You have > basically no control over what they run there. Now if you are brave > enough to enable nested, you most certainly won't want async_pf to > create problems for your customers only because they have a kernel with > async_pf support and a hypervisor without (which at the moment means a > significant fraction of VPS owners). In that situation, you already told your customers to disable kvm-apf, because it is broken (on VMX). After updating the L0, you announce that kvm-apf can be enabled and depending on the fix that KVM uses, it is either enabled only for sufficiently new L1, or even for older ones. Not a big difference from VPS provider point of view, IMO. (Hm, and VPS providers could use a toggle to disable kvm-apf on L0, because it adds overhead in scenarios with CPU overcommit.) >> >> AMD's behavior makes sense and already works, therefore I'd like to see >> >> the same on Intel as well. (I thought that SVM was broken as well, >> >> sorry for my misleading first review.) >> >> >> >> > To avoid that, only do async_pf stuff when executing L1 guest. >> >> >> >> The good thing is that we are already killing VMX L1 with async_pf, so >> >> regressions don't prevent us from making Intel KVM do the same as AMD: >> >> force a nested VM exit from nested_vmx_check_exception() if the injected >> >> #PF is async_pf and handle the #PF VM exit in L1. >> > >> > I'm not getting your point: the wealth of existing hypervisors running >> > in L1 which don't take #PF vmexits can be made not to hang or crash >> > their guests with a not so complex fix in L0 hypervisor. Why do the >> > users need to update *both* their L0 and L1 hypervisors instead? >> >> L1 enables paravirtual async_pf to get notified about L0 page faults, >> which would allow L1 to reschedule the blocked process and get better >> performance. Running a guest is just another process in L1, hence we >> can assume that L1 is interested in being notified. > > That's a nice theory but in practice there is a fair amount of installed > VMs with a kernel that requests async_pf and a hypervisor that can't > live with it. Yes, and we don't have to care -- they live now, when kvm-apf is broken. We can fix them in a way that is backward compatible with known hypervisors, but the solution is worse because of that. kvm-apf is just for L1 performance, so it should waste as little cycles as possible and because users can't depend on working kvm-apf, I'd not shackle ourselves by past mistakes. >> If you want a fix without changing L1 hypervisors, then you need to >> regress KVM on SVM. > > I don't buy this argument. I don't see any significant difference from > L0's viewpoint between emulating a #PF vmexit and emulating an external > interrupt vmexit combined with #PF injection into L1. The latter, > however, will keep L1 getting along just fine with the existing kernels > and hypervisors. Yes, the delivery method is not crucial, I'd accept another delivery method if L1 on KVM+SVM doesn't regress performance. The main regression is not forwarding L0 page faults to L1 while nested, because of this condition: if (!prefault && !is_guest_mode(vcpu) && can_do_async_pf(vcpu)) { >> This series regresses needlessly, though -- it forces L1 to wait in L2 >> until the page for L2 is fetched by L0. > > Indeed, it's half-baked. I also just realized that it incorrectly does > nested vmexit before L1 vmentry but #PF injection is attempted on the > next round which defeats the whole purpose. I also see separating the nested VM exit from the kvm-apf event delivery as a regression -- doesn't delivering interrupt vector 14 in the nested VM exit work without losing backward compatibility? Thanks. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html