Re: [PATCH v12 20/22] x86/virt/tdx: Allow SEAMCALL to handle #UD and #GP

"Huang, Kai" <kai.huang@xxxxxxxxx> · Wed, 5 Jul 2023 12:53:58 +0000

On Wed, 2023-07-05 at 14:19 +0200, Peter Zijlstra wrote:
> On Wed, Jul 05, 2023 at 11:34:53AM +0000, Huang, Kai wrote:
> 
> > Yeah I think from long-term's view, since SEAMCALLs to support live migration
> > pretty much uses all RCX/RDX/R8-R15 as input/output, it seems reasonable to
> > unify all of them, although I guess there might be some special handling to
> > VP.VMCALL and/or VP.ENTER, e.g., below:
> > 
> >         /* TDVMCALL leaf return code is in R10 */                              
> >         movq %r10, %rax
> > 
> > So long-termly, I don't have objection to that.  But my thinking is for the
> > first version of TDX host support, we don't have to support all SEAMCALLs but
> > only those involved in basic TDX support. 
> 
> Since those calls are out now, we should look at them now, there is no
> point in delaying the pain. That then gives us two options:
> 
>  - we accept them and their wonky calling convention and our code should
>    be ready for it.
> 
>  - we reject them and send the TDX team a message to please try again
>    but with a saner calling convention.
> 
> Sticking our head in the sand and pretending like they don't exist isn't
> really a viable option at this point.

OK.  I'll work on this.

But I think even we want to unify __tdx_module_call() and __tdx_hypercall(), the
first step should be making __tdx_module_call() look like __tdx_hypercall()?  I
mean from organizing patchset's point of view, we cannot just do in one big
patch but need to split into small patches with each doing one thing.

By thinking is perhaps we can organize this way:

 1) Patch(es) to make TDX_MODULE_CALL macro / __tdx_module_call() look like
__tdx_hypercall().
 2) Add SEAMCALL support based on TDX_MODULE_CALL, e.g., implement __seamcall().
 3) Unify __tdx_module_call()/__seamcall() with __tdx_hypercall().

Does this look good?

Btw, I've already part 1) based on your code, and sent the patches to Kirill for
review.  Should I sent them out first?

> 
> > Also, the new SEAMCALLs to handle live migration all seem to have below
> > statement:
> > 
> > 	AVX, AVX2	May be reset to the architectural INIT state
> > 	and
> > 	AVX512
> > 	state
> > 
> > Which means those SEAMCALLs need to preserve AVX* states too?
> 
> Yes, we need to ensure the userspace 'FPU' state is saved before
> we call them. But I _think_ that KVM already does much of that.

Let me look into this.

> 
> > And reading the spec, the VP.VMCALL and VP.ENTER also can use XMM0 - XMM15 as
> > input/output.  Linux VP.VMCALL seems doesn't support using XMM0 - XMM15 as
> > input/output, but KVM can run other guest OSes too so I think KVM VP.ENTER needs
> > to handle XMM0-XMM15 as input/output too.
> 
> Why would KVM accept VMCALLs it doesn't know about? Just trash the
> guest and call it a day.
> 
> > That being said, I think although we can provide a common asm macro to cover
> > VP.ENTER, I suspect KVM still needs to do additional assembly around the macro
> > too.  So I am not sure whether we should try to cover VP.ENTER.
> 
> Not sure about asm, we have interfaces to save the XMM/AVX regs.
> kernel_fpu_begin() comes to mind, but I know there's more of that,
> including some for KVM specifically.

Yeah doesn't have to be asm if it can be done in C.

> 
> > > I don't think they should be special, they're really just yet another
> > > leaf call. Yes, they have a shit calling convention, and yes VP.ENTER is
> > > terminally broken for unconditionally clobbering BP :-(
> > > 
> > > That really *must* be fixed.
> > 
> > Sure I don't have objection to this, and for VP.ENTER please see above.
> > 
> > But I'd like to say that, generally speaking, from virtualization's point of
> > view, guest has its own BP and conceptually the hypervisor needs to restore
> > guest's BP before jumping to the guest.  E.g., for normal VMX guest, KVM always
> > restores guest's BP before VMENTER (arch/x86/kvm/vmx/vmenter.S):
> > 
> > SYM_FUNC_START(__vmx_vcpu_run)
> >         push %_ASM_BP
> >         mov  %_ASM_SP, %_ASM_BP
> > 	
> > 	...
> > 	mov VCPU_RBP(%_ASM_AX), %_ASM_BP
> > 	...
> > 	vmenter/vmresume
> > 	...
> > SYM_INNER_LABEL(vmx_vmexit, SYM_L_GLOBAL)
> > 	.....
> > 	mov %_ASM_BP, VCPU_RBP(%_ASM_AX)
> > 	...
> > 	pop %_ASM_BP
> >         RET
> 
> That's disgusting :/ So what happens if we get an NMI after VMENTER and
> before POP? Then it sees a garbage BP value.

Looks so.

> 
> Why is all this stuff such utter crap?
> 

The problem is KVM has to save/restore BP for guest, because VMX hardware
doesn't save/restore BP during VMENTER/VMEXIT.  I am not sure whether there's a
better way to handle.

My brain is getting slow right now as it's 1-hour past midnight already.  I am
hoping Paolo/Sean can jump in here. :)