On Mon, 2023-11-27 at 12:05 -0800, Dave Hansen wrote: > On 11/27/23 11:33, Huang, Kai wrote: > > On Mon, 2023-11-27 at 10:13 -0800, Hansen, Dave wrote: > > > On 11/9/23 03:55, Kai Huang wrote: > > > ... > > > > --- a/arch/x86/kernel/reboot.c > > > > +++ b/arch/x86/kernel/reboot.c > > > > @@ -31,6 +31,7 @@ > > > > #include <asm/realmode.h> > > > > #include <asm/x86_init.h> > > > > #include <asm/efi.h> > > > > +#include <asm/tdx.h> > > > > > > > > /* > > > > * Power off function, if any > > > > @@ -741,6 +742,20 @@ void native_machine_shutdown(void) > > > > local_irq_disable(); > > > > stop_other_cpus(); > > > > #endif > > > > + /* > > > > + * stop_other_cpus() has flushed all dirty cachelines of TDX > > > > + * private memory on remote cpus. Unlike SME, which does the > > > > + * cache flush on _this_ cpu in the relocate_kernel(), flush > > > > + * the cache for _this_ cpu here. This is because on the > > > > + * platforms with "partial write machine check" erratum the > > > > + * kernel needs to convert all TDX private pages back to normal > > > > + * before booting to the new kernel in kexec(), and the cache > > > > + * flush must be done before that. If the kernel took SME's way, > > > > + * it would have to muck with the relocate_kernel() assembly to > > > > + * do memory conversion. > > > > + */ > > > > + if (platform_tdx_enabled()) > > > > + native_wbinvd(); > > > > > > Why can't the TDX host code just set host_mem_enc_active=1? > > > > > > Sure, you'll end up *using* the SME WBINVD support, but then you don't > > > have two different WBINVD call sites. You also don't have to mess with > > > a single line of assembly. > > > > I wanted to avoid changing the assembly. > > > > Perhaps the comment isn't very clear. Flushing cache (on the CPU running kexec) > > when the host_mem_enc_active=1 is handled in the relocate_kernel() assembly, > > which happens at very late stage right before jumping to the new kernel. > > However for TDX when the platform has erratum we need to convert TDX private > > pages back to normal, which must be done after flushing cache. If we reuse > > host_mem_enc_active=1, then we will need to change the assembly code to do that. > > I honestly think you need to stop thinking about the partial write issue > at this point in the series. It's really causing a horrible amount of > unnecessary confusion. > > Here's the golden rule: > > The boot CPU needs to run WBINVD sometime after it stops writing > to private memory but before it starts treating the memory as > shared. > > On SME kernels, that key point evidently in early boot when it's > enabling SME. I _think_ that point is also a valid place to do WBINVD > on no-TDX-erratum systems. You mean we could advertise cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT) true for TDX host? We could but IMHO it doesn't perfectly match. SME kernel sets _PAGE_ENC on by default for all memory mappings IIUC, but TDX host never actually sets any encryption bits in page tables managed by the kernel. So I think we can just do below, but not advertise CC_ATTR_HOST_MEM_ENCRYPT for TDX host? --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -377,7 +377,8 @@ void machine_kexec(struct kimage *image) (unsigned long)page_list, image->start, image->preserve_context, - cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT)); + cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT) || + platform_tdx_enabled()); > > On TDX systems with the erratum, there's a *second* point before the > private=>shared conversion occurs. I think what you're trying to do > here is prematurely optimize the erratum-affected affected systems so > that they don't do two WBINVDs. Please stop trying to do that. > > This WBINVD is only _needed_ for the erratum. It should be closer to > the actual erratum handing. If we do WBINVD early here then the second one isn't needed. But 100% agreed this handling/optimization should be done later closer to the erratum handling.