On Fri, Nov 26, 2021, Lai Jiangshan wrote: > From: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx> > > Both VMX and SVM made mistakes of calling kvm_read_and_reset_apf_flags() > in instrumentable code: > vmx_vcpu_enter_exit() > ASYNC PF induced VMEXIT > save cr2 > leave noinstr section > -- kernel #PF can happen here due to instrumentation > handle_exception_nmi_irqoff > kvm_read_and_reset_apf_flags() > > If kernel #PF happens before handle_exception_nmi_irqoff() after leaving > noinstr section due to instrumentation, kernel #PF handler will consume > the incorrect apf flags and panic. > > The problem might be fixed by moving kvm_read_and_reset_apf_flags() > into vmx_vcpu_enter_exit() to consume apf flags earlier before leaving > noinstr section like the way it handles CR2. > > But linux kernel only resigters ASYNC PF for userspace and guest, so I'd omit the guest part. While technically true, it's not relevant to the change as the instrumentation safety comes purely from the user_mode() check. Mentioning the "guest" side of things gets confusing as the "host" may be an L1 kernel, in which case it is also a guest and may have its own guests. It'd also be helpful for future readers to call out that this is true only since commit 3a7c8fafd1b4 ("x86/kvm: Restrict ASYNC_PF to user space"). Allowing async #PF in kernel mode was presumably why this code was non-instrumentable in the first place. > ASYNC PF can't hit when it is in kernel, so apf flags can be changed to > be consumed only when the #PF is from guest or userspace. ... > @@ -1538,7 +1514,20 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault) > state = irqentry_enter(regs); > > instrumentation_begin(); > - handle_page_fault(regs, error_code, address); > + > + /* > + * KVM uses #PF vector to deliver 'page not present' events to guests > + * (asynchronous page fault mechanism). The event happens when a > + * userspace task is trying to access some valid (from guest's point of > + * view) memory which is not currently mapped by the host (e.g. the > + * memory is swapped out). Note, the corresponding "page ready" event > + * which is injected when the memory becomes available, is delivered via > + * an interrupt mechanism and not a #PF exception > + * (see arch/x86/kernel/kvm.c: sysvec_kvm_asyncpf_interrupt()). > + */ > + if (!user_mode(regs) || !kvm_handle_async_user_pf(regs, (u32)address)) > + handle_page_fault(regs, error_code, address); To preserve the panic() if an async #PF is injected for !user_mode, any reason not to do? if (!kvm_handle_async_user_pf(regs, (u32)address)) handle_page_fault(regs, error_code, address); Then __kvm_handle_async_user_pf() can keep its panic() on !user_mode(). And it also saves a conditional when kvm_async_pf_enabled is not true.