Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups

Maxim Levitsky <mlevitsk@xxxxxxxxxx> · Thu, 30 Jun 2022 11:24:33 +0300



On Wed, 2022-06-29 at 08:53 -0700, Jim Mattson wrote:
> On Wed, Jun 29, 2022 at 4:17 AM Maxim Levitsky <mlevitsk@xxxxxxxxxx> wrote:
> > On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> > > The main goal of this series is to fix KVM's longstanding bug of not
> > > honoring L1's exception intercepts wants when handling an exception that
> > > occurs during delivery of a different exception.  E.g. if L0 and L1 are
> > > using shadow paging, and L2 hits a #PF, and then hits another #PF while
> > > vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> > > KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> > > so that the #PF is routed to L1, not injected into L2 as a #DF.
> > > 
> > > nVMX has hacked around the bug for years by overriding the #PF injector
> > > for shadow paging to go straight to VM-Exit, and nSVM has started doing
> > > the same.  The hacks mostly work, but they're incomplete, confusing, and
> > > lead to other hacky code, e.g. bailing from the emulator because #PF
> > > injection forced a VM-Exit and suddenly KVM is back in L1.
> > > 
> > > Everything leading up to that are related fixes and cleanups I encountered
> > > along the way; some through code inspection, some through tests.
> > > 
> > > v2:
> > >   - Rebased to kvm/queue (commit 8baacf67c76c) + selftests CPUID
> > >     overhaul.
> > >     https://lore.kernel.org/all/20220614200707.3315957-1-seanjc@xxxxxxxxxx
> > >   - Treat KVM_REQ_TRIPLE_FAULT as a pending exception.
> > > 
> > > v1: https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@xxxxxxxxxx
> > > 
> > > Sean Christopherson (21):
> > >   KVM: nVMX: Unconditionally purge queued/injected events on nested
> > >     "exit"
> > >   KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
> > >   KVM: x86: Don't check for code breakpoints when emulating on exception
> > >   KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
> > >   KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
> > >   KVM: x86: Treat #DBs from the emulator as fault-like (code and
> > >     DR7.GD=1)
> > >   KVM: x86: Use DR7_GD macro instead of open coding check in emulator
> > >   KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
> > >   KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
> > >   KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
> > >   KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
> > >   KVM: x86: Make kvm_queued_exception a properly named, visible struct
> > >   KVM: x86: Formalize blocking of nested pending exceptions
> > >   KVM: x86: Use kvm_queue_exception_e() to queue #DF
> > >   KVM: x86: Hoist nested event checks above event injection logic
> > >   KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
> > >     VM-Exit
> > >   KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
> > >   KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
> > >   KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
> > >     behavior
> > >   KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
> > >   KVM: selftests: Add an x86-only test to verify nested exception
> > >     queueing
> > > 
> > >  arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
> > >  arch/x86/include/asm/kvm_host.h               |  35 +-
> > >  arch/x86/kvm/emulate.c                        |   3 +-
> > >  arch/x86/kvm/svm/nested.c                     | 102 ++---
> > >  arch/x86/kvm/svm/svm.c                        |  18 +-
> > >  arch/x86/kvm/vmx/nested.c                     | 319 +++++++++-----
> > >  arch/x86/kvm/vmx/sgx.c                        |   2 +-
> > >  arch/x86/kvm/vmx/vmx.c                        |  53 ++-
> > >  arch/x86/kvm/x86.c                            | 404 +++++++++++-------
> > >  arch/x86/kvm/x86.h                            |  11 +-
> > >  tools/testing/selftests/kvm/.gitignore        |   1 +
> > >  tools/testing/selftests/kvm/Makefile          |   1 +
> > >  .../selftests/kvm/include/x86_64/svm_util.h   |   7 +-
> > >  .../selftests/kvm/include/x86_64/vmx.h        |  51 +--
> > >  .../kvm/x86_64/nested_exceptions_test.c       | 295 +++++++++++++
> > >  15 files changed, 886 insertions(+), 418 deletions(-)
> > >  create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
> > > 
> > > 
> > > base-commit: 816967202161955f398ce379f9cbbedcb1eb03cb
> > 
> > Hi Sean and everyone!
> > 
> > 
> > Before I continue reviewing the patch series, I would like you to check if
> > I understand the monitor trap/pending debug exception/event injection
> > logic on VMX correctly. I was looking at the spec for several hours and I still have more
> > questions that answers about it.
> > 
> > So let me state what I understand:
> > 
> > 1. Event injection (aka eventinj in SVM terms):
> > 
> >   (VM_ENTRY_INTR_INFO_FIELD/VM_ENTRY_EXCEPTION_ERROR_CODE/VM_ENTRY_INSTRUCTION_LEN)
> > 
> >   If I understand correctly all event injections types just like on SVM just inject,
> >   and never create something pending, and/or drop the injection if event is not allowed
> >   (like if EFLAGS.IF is 0). VMX might have some checks that could fail VM entry,
> >   if for example you try to inject type 0 (hardware interrupt) and EFLAGS.IF is 0,
> >   I haven't checked this)
> > 
> >   All event injections happen right away, don't deliver any payload (like DR6), etc.
> > 
> >   Injection types 4/5/6, do the same as injection types 0/2/3 but in addition to that,
> >   type 4/6 do a DPL check in IDT, and also these types can promote the RIP prior
> >   to pushing it to the exception stack using VM_ENTRY_INSTRUCTION_LEN to be consistent
> >   with cases when these trap like events are intercepted, where the interception happens
> >   on the start of the instruction despite exceptions being trap-like.
> > 
> > 
> > 2. #DB is the only trap like exception that can be pending for one more instruction
> >    if MOV SS shadow is on (any other cases?).
> >    (AMD just ignores the whole thing, rightfully)
> > 
> >    That is why we have the GUEST_PENDING_DBG_EXCEPTIONS vmcs field.
> >    I understand that it will be written by CPU in case we have VM exit at the moment
> >    where #DB is already pending but not yet delivered.
> > 
> >    That field can also be (sadly) used to "inject" #DB to the guest, if the hypervisor sets it,
> >    and this #DB will actually update DR6 and such, and might be delayed/lost.
> > 
> > 
> > 3. Facts about MTF:
> > 
> >    * MTF as a feature is basically 'single step the guest by generating MTF VM exits after each executed
> >      instruction', and is enabled in primary execution controls.
> > 
> >    * MTF is also an 'event', and it can be injected separately by the hypervisor with event type 7,
> >      and that has no connection to the 'feature', although usually this injection will be useful
> >      when the hypervisor does some kind of re-injection, triggered by the actual MTF feature.
> > 
> >    * MTF event can be lost, if higher priority VM exit happens, this is why the SDM says about 'pending MTF',
> >      which means that MTF vmexit should happen unless something else prevents it and/or higher priority VM exit
> >      overrides it.
> > 
> >    * MTF event is raised (when the primary execution controls bit is enabled) when:
> > 
> >         - after an injected (vectored), aka eventinj/VM_ENTRY_INTR_INFO_FIELD, done updating the guest state
> >           (that is stack was switched, stuff was pushed to new exception stack, RIP updated to the handler)
> >           I am not 100% sure about this but this seems to be what PRM implies:
> > 
> >           "If the “monitor trap flag” VM-execution control is 1 and VM entry is injecting a vectored event (see Section
> >           26.6.1), an MTF VM exit is pending on the instruction boundary before the first instruction following the
> >           VM entry."
> > 
> >         - If an interrupt and or #DB exception happens prior to executing first instruction of the guest,
> >           then once again MTF will happen on first instruction of the exception/interrupt handler
> > 
> >           "If the “monitor trap flag” VM-execution control is 1, VM entry is not injecting an event, and a pending event
> >           (e.g., debug exception or interrupt) is delivered before an instruction can execute, an MTF VM exit is pending
> >           on the instruction boundary following delivery of the event (or any nested exception)."
> > 
> >           That means that #DB has higher priority that MTF, but not specified if fault DB or trap DB
> > 
> >         - If instruction causes exception, once again, on first instruction of the exception handler MTF will happen.
> > 
> >         - Otherwise after an instruction (or REP iteration) retires.
> > 
> > 
> > If you have more facts about MTF and related stuff and/or if I made a mistake in the above, I am all ears to listen!
> 
> Here's a comprehensive spreadsheet on virtualizing MTF, compiled by
> Peter Shier. (Just in case anyone is interested in *truly*
> virtualizing the feature under KVM, rather than just setting a
> VM-execution control bit in vmcs02 and calling it done.)
> 
> https://docs.google.com/spreadsheets/d/e/2PACX-1vQYP3PgY_JT42zQaR8uMp4U5LCey0qSlvMb80MLwjw-kkgfr31HqLSqAOGtdZ56aU2YdVTvfkruhuon/pubhtml

Neither can I access this document sadly :(

Best regards,
	Maxim Levitsky

>