Re: [PATCH v2 12/25] KVM: VMX: Handle FRED event data

Sean Christopherson <seanjc@xxxxxxxxxx> · Thu, 13 Jun 2024 10:57:52 -0700

On Thu, Jun 13, 2024, Chao Gao wrote:
> On Wed, Jun 12, 2024 at 04:31:35PM -0700, Sean Christopherson wrote:
> >On Sat, May 11, 2024, Chao Gao wrote:
> >> On Fri, May 10, 2024 at 05:36:03PM +0800, Li, Xin3 wrote:
> >> >> >+               if (kvm_is_fred_enabled(vcpu)) {
> >> >> >+                       u64 event_data = 0;
> >> >> >+
> >> >> >+                       if (is_debug(intr_info))
> >> >> >+                               /*
> >> >> >+                                * Compared to DR6, FRED #DB event data saved on
> >> >> >+                                * the stack frame have bits 4 ~ 11 and 16 ~ 31
> >> >> >+                                * inverted, i.e.,
> >> >> >+                                *   fred_db_event_data = dr6 ^ 0xFFFF0FF0UL
> >> >> >+                                */
> >> >> >+                               event_data = vcpu->arch.dr6 ^ DR6_RESERVED;
> >> >> >+                       else if (is_page_fault(intr_info))
> >> >> >+                               event_data = vcpu->arch.cr2;
> >> >> >+                       else if (is_nm_fault(intr_info))
> >> >> >+                               event_data =
> >> >> >+ to_vmx(vcpu)->fred_xfd_event_data;
> >> >> >+
> >> >> 
> >> >> IMO, deriving an event_data from CR2/DR6 is a little short-sighted because the
> >> >> event_data and CR2/DR6 __can__ be different, e.g., L1 VMM __can__ set CR2 to A
> >> >> and event_data field to B (!=A) when injecting #PF.
> >> >
> >> >VMM should guarantee a FRED guest _sees_ consistent values in CR6/DR6
> >> >and event data. If not it's just a VMM bug that we need to fix.
> >> 
> >> I don't get why VMM should.
> >> 
> >> I know the hardware will guarantee this. And likely KVM will also do this.
> >> but I don't think it is necessary for KVM to assume L1 VMM will guarantee
> >> this. because as long as L2 guest is enlightened to read event_data from stack
> >> only, the ABI between L1 VMM and L2 guest can be: CR2/DR6 may be out of sync
> >> with the event_data. I am not saying it is good that L1 VMM deviates from the
> >> real hardware behavior. But how L1 VMM defines this ABI with L2 has nothing to
> >> do with KVM as L0. KVM shouldn't make assumptions on that.
> >
> >Right, but in that case the propagation of event_data would be from vmcs12 =>
> >vmcs02, which is handled by prepare_vmcs02_early().
> 
> Yes. But delivering this event to L2 may cause VM-exit. So, L0 KVM may need to
> re-inject this event ...
> 
> >
> >For this flow, it specifically handles exception injection from _L0 KVM_, in which
> >case KVM should always follow the architectural behavior.
> 
> ... and go through this exception injection flow. For such an event, there is no
> guarantee that the associated event data is consistent with the vCPU's
> DR6/CR2/XFD_ERR.
> 
> >
> >Ahh, but the code in with __vmx_complete_interrupts() is wrong.  Overwriting
> >vcpu->arch.{dr6,cr2} is wrong, because theres no telling what was in vmcs02.
> >And even if vmcs02 holds DR6/CR2 values, those might be L2 values, i.e. shouldn't
> >clobber the vCPU state.
> 
> Exactly.
> 
> >
> >It's not clear to me that we need to do anything new for FRED in
> >__vmx_complete_interrupts().  The relevant VMCS fields should already hold the
> >correct values, there's no reason to clobber vCPU state.  The reason KVM grabs
> 
> The whole point is to cache the ORIGINAL_EVENT_DATA VMCS field so that KVM can
> set it back to the INJECTED_EVENT_DATA VMCS field when reinjecting the pending
> event in IDT-vectoring information.

Hrm, right.  I was thinking INJECTED_EVENT_DATA would already hold the correct
data, but that's only true when the VM-Exit occurred on an injected event, i.e.
when KVM already set the relevant fields.  Ah, and not capturing the state would
lead to loss of data on migration.

I think the right way to handle this is to add kvm_queued_exception.event_data,
and then fill event_data during kvm_deliver_exception_payload() when injecting
an event for the first time, and set it directly when re-injecting an event.  The
event data is effectively the same thing as the payload, it just happens to be
deliver on the event stack frame, not via architectural register state.

And I think we should also rework kvm_requeue_exception() to open code stuffing
vcpu->arch.exception instead of using kvm_multiple_exception().  The two flows
(injection vs. re-injection) don't actually have that much in common, and the
common parts are the super duper simple things, e.g. actually setting values and
requested KVM_REQ_EVENT.

Aha!  And there is a pre-existing bug in handle_nm_fault_irqoff(), as it clobbers
guest XFD_ERR if CR0.TS=1.

Speaking of XFD_ERR, I think the best way to deal with that is to pass it along
as a payload, but then simply do nothing when delivering the payload... until
FRED comes along, and then kvm_deliver_exception_payload() can be responsible
for setting FRED's ex->event_data.

With that combination of tweaks, "normal" injection always sets event_data via
the payload, and because re-injected events never deliver payloads (already
delivered), KVM will naturally avoid clobbering ex->event_data with stale state
(and obviously doesn't need to overwrite CR2 or DR6).

Attached patches are compile-tested only.  They're a subset of the overall series,
hopefully it's fairly easy to understand where they slot in.