Re: RFC: Fixing the broken virtual VMX-preemption timer

Jim Mattson <jmattson@xxxxxxxxxx> · Tue, 18 Dec 2018 10:22:10 -0800

On Tue, Dec 18, 2018 at 10:01 AM Sean Christopherson
<sean.j.christopherson@xxxxxxxxx> wrote:
>
> On Tue, Dec 18, 2018 at 09:38:27AM -0800, Jim Mattson wrote:
> > On Tue, Dec 18, 2018 at 7:04 AM Sean Christopherson
> > <sean.j.christopherson@xxxxxxxxx> wrote:
> > >
> > > On Mon, Dec 17, 2018 at 03:14:14PM -0800, Jim Mattson wrote:
> > > > The virtual VMX preemption timer doesn't behave correctly when the
> > > > VMCS12 VMX-preemption timer value field is 0 and there is an injected
> > > > event in the VMCS12. The event should be vectored through the guest
> > > > IDT before the "VMX-preemption timer expired" VM-exit from L2 to L1 is
> > > > synthesized by L0, but it is not. Similarly, the virtual VMX
> > > > preemption timer doesn't behave correctly when the VMCS12
> > > > VMX-preemption timer value field is 0 and there are pending debug
> > > > exceptions in the VMCS12. The pending debug exceptions should be
> > > > delivered before the "VMX-preemption timer expired" VM-exit from L2 to
> > > > L1 is synthesized by L0, but they are not.
> > > >
> > > > The easiest way to fix this is to use the VMX-preemption timer in
> > > > VMCS02 whenever the VMCS12 VMX-preemption timer value field is 0.
> > > > Multiplexing with the existing usage of the VMCS02 VMX-preemption
> > > > timer is straightforward. However, this approach introduces a
> > > > dependency on the underlying hardware having VMX-preemption timer
> > > > support. (Even broken VMX-preemption timer support should be
> > > > sufficient. I know of no VMX preemption-timer errata that would impact
> > > > the case where the VMX-preemption timer value field is 0.)
> > > > Unfortunately, commit f4124500c2c13 ("KVM: nVMX: Fully emulate
> > > > preemption timer") removed the dependency of the virtual
> > > > VMX-preemption timer on a hardware VMX-preemption timer.
> > > >
> > > > I see at least the following three options:
> > > > 1) Require a hardware VMX-preemption timer before advertising a
> > > > virtual VMX-preemption timer.
> > > > 2) Only provide a working virtual VMX-preemption timer when there is a
> > > > hardware VMX-preemption timer, but continue to advertise the broken
> > > > VMX-preemption timer on platforms that don't support a hardware
> > > > VMX-preemption timer.
> > > > 3) Teach kvm how to do guest IDT-vectoring in software, so that a
> > > > hardware VMX-preemption timer isn't necessary.
> > > >
> > > > Thoughts? Other options?
> > >
> > > 4) Move the exception handling out of vmx_check_nested_events() and into
> > >    a separate function, and reorder the flow of inject_pending_event()
> > >    to prioritize VOE.  kvm_vcpu_running() also uses .check_nested_events(),
> > >    not sure what needs to be done there.
> >
> > Unless I'm missing something, this reorganization seems orthogonal to
> > (1) or (2). That is, even if we fix the code that was causing us to
> > bypass the launch of vmcs02, how do we get a VM-exit after the event
> > injection if we don't set up a zero-valued VMX-preemption timer in
> > vmcs02?
>
> inject_pending_event should do VOE injection AND return -EBUSY to request
> an immediate exit, e.g. vmx_check_nested_events() should take into account
> the fact that we just injected a VOE, i.e. set block_nested_events.
> request_immediate_exit() will use the preemption timer when possible, so
> it should "just work".
>
> Hardware without a preemption timer should also work.  Even though commit
> d264ee0c2ed2 ("KVM: VMX: use preemption timer to force immediate VMExit")
> correctly states that using a self-IPI to request an immediate exit is
> wrong, it's only really wrong in theory.  In practice the IPI will arrive
> as soon as the VOE is vectored in the guest (the unit test was failing
> because there was also a bug in KVM's nested INTR handling).

Makes sense. (But what's VOE?)