Re: [patch 13/31] x86/fpu: Move KVMs FPU swapping to FPU core

"Andy Lutomirski" <luto@xxxxxxxxxx> · Wed, 13 Oct 2021 03:14:44 -0700

On Wed, Oct 13, 2021, at 1:42 AM, Paolo Bonzini wrote:
> On 13/10/21 09:46, Liu, Jing2 wrote:
>> 
>>> On 13/10/21 08:15, Liu, Jing2 wrote:
>>>> After KVM passthrough XFD to guest, when vmexit opening irq window and
>>>> KVM is interrupted, kernel softirq path can call
>>>> kernel_fpu_begin() to touch xsave state. This function does XSAVES. If
>>>> guest XFD[18] is 1, and with guest AMX state in register, then guest
>>>> AMX state is lost by XSAVES.
>>>
>>> Yes, the host value of XFD (which is zero) has to be restored after vmexit.
>>> See how KVM already handles SPEC_CTRL.
>> 
>> I'm trying to understand why qemu's XFD is zero after kernel supports AMX.
>
> There are three copies of XFD:
>
> - the guest value stored in vcpu->arch.
>
> - the "QEMU" value attached to host_fpu.  This one only becomes zero if 
> QEMU requires AMX (which shouldn't happen).
>
> - the internal KVM value attached to guest_fpu.  When #NM happens, this 
> one becomes zero.
>
>
> The CPU value is:
>
> - the host_fpu value before kvm_load_guest_fpu and after 
> kvm_put_guest_fpu.  This ensures that QEMU context switch is as cheap as 
> possible.
>
> - the guest_fpu value between kvm_load_guest_fpu and kvm_put_guest_fpu. 
>   This ensures that no state is lost in the case you are describing.
>
> - the OR of the guest value and the guest_fpu value while the guest runs 
> (using either MSR load/save lists, or manual wrmsr like 
> pt_guest_enter/pt_guest_exit).  This ensures that the host has the 
> opportunity to get a #NM exception, and allocate AMX state in the 
> guest_fpu and in current->thread.fpu.
>
>> Yes, passthrough is done by two cases: one is guest #NM trapped;
>> another is guest clearing XFD before it generates #NM (this is possible for
>> guest), then passthrough.
>> For the two cases, we passthrough and allocate buffer for guest_fpu, and
>> current->thread.fpu.
>
> I think it's simpler to always wait for #NM, it will only happen once 
> per vCPU.  In other words, even if the guest clears XFD before it 
> generates #NM, the guest_fpu's XFD remains nonzero and an #NM vmexit is 
> possible.  After #NM the guest_fpu's XFD is zero; then passthrough can 
> happen and the #NM vmexit trap can be disabled.

This will stop being at all optimal when Intel inevitably adds another feature that uses XFD.  In the potentially infinite window in which the guest manages XFD and #NM on behalf of its userspace and when the guest allocates the other hypothetical feature, all the #NMs will have to be trapped by KVM.

Is it really worthwhile for KVM to use XFD at all instead of preallocating the state and being done with it?  KVM would still have to avoid data loss if the guest sets XFD with non-init state, but #NM could always pass through.