2016-12-12 17:32+0300, Roman Kagan: > Async pagefault machinery assumes communication with L1 guests only: all > the state -- MSRs, apf area addresses, etc, -- are for L1. However, it > currently doesn't check if the vCPU is running L1 or L2, and may inject > a #PF into whatever context is currently executing. > > In vmx this just results in crashing the L2 on bogus #PFs and hanging > tasks in L1 due to missing PAGE_READY async_pfs. To reproduce it, use a > host with swap enabled, run a VM on it, run a nested VM on top, and set > RSS limit for L1 on the host via > /sys/fs/cgroup/memory/machine.slice/machine-*.scope/memory.limit_in_bytes > to swap it out (you may need to tighten and loosen it once or twice, or > create some memory load inside L1). Very quickly L2 guest starts > receiving pagefaults with bogus %cr2 (apf tokens from the host > actually), and L1 guest starts accumulating tasks stuck in D state in > kvm_async_pf_task_wait. > > In svm such #PFs are converted into vmexit from L2 to L1 on #PF which is > then handled by L1 similar to ordinary async_pf. However this only > works with KVM running in L1; another hypervisor may not expect this > (e.g. VirtualBox asserts on #PF vmexit when NPT is on). async_pf is an optional paravirtual device. It is L1's fault if it enabled something that it doesn't support ... AMD's behavior makes sense and already works, therefore I'd like to see the same on Intel as well. (I thought that SVM was broken as well, sorry for my misleading first review.) > To avoid that, only do async_pf stuff when executing L1 guest. The good thing is that we are already killing VMX L1 with async_pf, so regressions don't prevent us from making Intel KVM do the same as AMD: force a nested VM exit from nested_vmx_check_exception() if the injected #PF is async_pf and handle the #PF VM exit in L1. I remember that you already implemented this and chose not to post it -- were there other problems than asserts in current KVM/VirtualBox? Thanks. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html