Re: [GIT PULL] KVM: x86: SVM changes for 6.4

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Wed, 26 Apr 2023 22:48:27 +0200

On Wed, Apr 26, 2023 at 10:02 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > This is probably the sub-PR for which I'm more interested in giving
> > the code a closer look, but this is more about understanding the
> > changes than it is about expecting something bad in it.
>
> 100% agree.  If you were to scrutinize only one thing for 6.4, the vNMI changes
> are definitely my choice for extra eyeballs.

Interesting read. The split commits left me wondering _why_ patches
1-7 were needed for vNMI, but that's a known limitation of losing the
cover letter, and the Link or Message-Id trailers try to amend for
that.

I have a few comments indeed, most of which are absolutely nits and
can be ignored or fixed as follow-ups. It's my turn to send a "belated
review" patch series, which I'll do for -rc2, but please check if
there are any disagreements.

First of all, this comment caught my attention:

+    /*
+     * Rules for synchronizing int_ctl bits from vmcb02 to vmcb01:
+     *
+     * V_IRQ, V_IRQ_VECTOR, V_INTR_PRIO_MASK, V_IGN_TPR:  If L1 doesn't
+     * intercept interrupts, then KVM will use vmcb02's V_IRQ (and related
+     * flags) to detect interrupt windows for L1 IRQs (even if L1 uses
+     * virtual interrupt masking).  Raise KVM_REQ_EVENT to ensure that
+     * KVM re-requests an interrupt window if necessary, which implicitly
+     * copies this bits from vmcb02 to vmcb01.
+     *
+     * V_TPR: If L1 doesn't use virtual interrupt masking, then L1's vTPR
+     * is stored in vmcb02, but its value doesn't need to be copied from/to
+     * vmcb01 because it is copied from/to the virtual APIC's TPR register
+     * on each VM entry/exit.
+     *
+     * V_GIF: If nested vGIF is not used, KVM uses vmcb02's V_GIF for L1's
+     * V_GIF.  However, GIF is architecturally clear on each VM exit, thus
+     * there is no need to copy V_GIF from vmcb02 to vmcb01.
+     */

"Rules for synchronizing int_ctl bits from vmcb02 to vmcb01" suggests
that this is work done here, and it also misled me into looking at
nested_sync_control_from_vmcb02 (which is instead about vmcb02 ->
vmcb12).

So what about replacing it with

    * int_ctl bits from vmcb02 have to be synchronized to both vmcb12
and vmcb01.
    * The former is in nested_sync_control_from_vmcb02, invoked on every vmexit,
    * while the latter is scattered all over the place:

and perhaps also call out nested_svm_virtualize_tpr(), sync_lapic_to_cr8() and
sync_cr8_to_lapic in the V_TPR part?

Another super small thing which is not worth a respin (would have been):

 Subject: [PATCH 05/13] KVM: x86: Raise an event request when processing NMIs
- if an NMI is pending
+ iff an NMI is pending

The "if" suggests that we were missing an event request, while "iff"
suggests that we were doing them unnecessarily.

As an aside, I like the "coding style violation" of commit
400fee8c9b2d. Because the "limit = 2" case is anti-architectural, it
makes more sense to have it as an "else" rather than as the default.
An alternative could have been:

  unsigned limit = 1;
  if (!static_call(kvm_x86_get_nmi_mask)(vcpu) && !vcpu->arch.nmi_injected)
      limit = 2;

but the ugly condition makes this solution worse.

Next on, commit ab2ee212a57b ("KVM: x86: Save/restore all NMIs when
multiple NMIs are pending"). Here, "allows userspace to restore 255
pending NMIs" in the commit message is kinda scary, and I thought
about following up with a fixlet to KVM_SET_VCPU_EVENTS:

+        events->nmi.pending = min(vcpu->arch.nmi_pending, 2);
         vcpu->arch.nmi_pending = 0;
         atomic_set(&vcpu->arch.nmi_queued, events->nmi.pending);
         kvm_make_request(KVM_REQ_NMI, vcpu);

but really this isn't needed because process_nmi() does have

     vcpu->arch.nmi_pending = min(vcpu->arch.nmi_pending, limit);

So in the end this is also fine, just a remark on the commit message.
May be worth an additional comment instead here in
KVM_SET_VCPU_EVENTS, before the atomic_set().

On to the actual vNMI patch:

+    /*
+     * KVM should never request an NMI window when vNMI is enabled, as KVM
+     * allows at most one to-be-injected NMI and one pending NMI, i.e. if
+     * two NMIs arrive simultaneously, KVM will inject one and set
+     * V_NMI_PENDING for the other.  WARN, but continue with the standard
+     * single-step approach to try and salvage the pending NMI.
+     */
+    WARN_ON_ONCE(is_vnmi_enabled(svm));

Understandable, but also scary. :) I am not sure losing a pending NMI
is a big deal. IIRC the "limit = 2" case only matters because Windows
uses an NMI shootdown when rebooting the system and in some cases it
would hang; but in this case we're in a buggy situation. And it means
having to think about how the IRET+single-step method interacts with
vNMI, and what is the meaning of !svm->awaiting_iret_completion
(tested right below) in this buggy case. I'd just "return" here.

And another small nit to conclude - kvm_get_nr_pending_nmis() could be static.

The only thing that leaves me a bit puzzled is the naming and
rationale of get_vnmi_vmcb_l1(). I'll let you or Santosh clarify that.

Thanks,

Paolo