2016-12-15 09:55+0300, Roman Kagan: > On Wed, Dec 14, 2016 at 10:21:11PM +0100, Radim Krčmář wrote: >> 2016-12-12 17:32+0300, Roman Kagan: >> > Async pagefault machinery assumes communication with L1 guests only: all >> > the state -- MSRs, apf area addresses, etc, -- are for L1. However, it >> > currently doesn't check if the vCPU is running L1 or L2, and may inject >> > a #PF into whatever context is currently executing. >> > >> > In vmx this just results in crashing the L2 on bogus #PFs and hanging >> > tasks in L1 due to missing PAGE_READY async_pfs. To reproduce it, use a >> > host with swap enabled, run a VM on it, run a nested VM on top, and set >> > RSS limit for L1 on the host via >> > /sys/fs/cgroup/memory/machine.slice/machine-*.scope/memory.limit_in_bytes >> > to swap it out (you may need to tighten and loosen it once or twice, or >> > create some memory load inside L1). Very quickly L2 guest starts >> > receiving pagefaults with bogus %cr2 (apf tokens from the host >> > actually), and L1 guest starts accumulating tasks stuck in D state in >> > kvm_async_pf_task_wait. >> > >> > In svm such #PFs are converted into vmexit from L2 to L1 on #PF which is >> > then handled by L1 similar to ordinary async_pf. However this only >> > works with KVM running in L1; another hypervisor may not expect this >> > (e.g. VirtualBox asserts on #PF vmexit when NPT is on). >> >> async_pf is an optional paravirtual device. It is L1's fault if it >> enabled something that it doesn't support ... > > async_pf in L1 is enabled by the core Linux; the hypervisor may be > third-party and have no control over it. Admin can pass no-kvmapf to Linux when planning to use a hypervisor that doesn't support paravirtualized async_pf. Linux allows only in-kernel hypervisors that do have full control over it. >> AMD's behavior makes sense and already works, therefore I'd like to see >> the same on Intel as well. (I thought that SVM was broken as well, >> sorry for my misleading first review.) >> >> > To avoid that, only do async_pf stuff when executing L1 guest. >> >> The good thing is that we are already killing VMX L1 with async_pf, so >> regressions don't prevent us from making Intel KVM do the same as AMD: >> force a nested VM exit from nested_vmx_check_exception() if the injected >> #PF is async_pf and handle the #PF VM exit in L1. > > I'm not getting your point: the wealth of existing hypervisors running > in L1 which don't take #PF vmexits can be made not to hang or crash > their guests with a not so complex fix in L0 hypervisor. Why do the > users need to update *both* their L0 and L1 hypervisors instead? L1 enables paravirtual async_pf to get notified about L0 page faults, which would allow L1 to reschedule the blocked process and get better performance. Running a guest is just another process in L1, hence we can assume that L1 is interested in being notified. If you want a fix without changing L1 hypervisors, then you need to regress KVM on SVM. This series regresses needlessly, though -- it forces L1 to wait in L2 until the page for L2 is fetched by L0. Even no-kvmapf in L1 is better, because L2 currently enters apf-halt state and an event could trigger a nested VM exit to L1 or reschedule L2 to a task that isn't waiting for the page. >> I remember that you already implemented this and chose not to post it -- >> were there other problems than asserts in current KVM/VirtualBox? > > You must have confused me with someone else ;) I didn't implement this; > moreover I tend to think that L1 hypervisor cooperation is unnecessary > and the fix can be done in L0 only. The feature already requires L1 cooperation and the extension handle L1 as a hypervisor seems natural to me: if L1 benefits from paravirtual async_pf, then it will likely benefit from it even when running L2s. Because the current state is already broken, I think it is a good time to actually do the best known solution right away, instead of fixing this fix later. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html