On Fri, 2023-01-06 at 16:35 -0800, Dave Hansen wrote: > On 12/8/22 22:52, Kai Huang wrote: > > There are two problems in terms of using kexec() to boot to a new kernel > > when the old kernel has enabled TDX: 1) Part of the memory pages are > > still TDX private pages (i.e. metadata used by the TDX module, and any > > TDX guest memory if kexec() happens when there's any TDX guest alive). > > 2) There might be dirty cachelines associated with TDX private pages. > > > > Because the hardware doesn't guarantee cache coherency among different > > KeyIDs, the old kernel needs to flush cache (of those TDX private pages) > > before booting to the new kernel. Also, reading TDX private page using > > any shared non-TDX KeyID with integrity-check enabled can trigger #MC. > > Therefore ideally, the kernel should convert all TDX private pages back > > to normal before booting to the new kernel. > > This is just talking about way too many things that just don't apply. > > Let's focus on the *ACTUAL* problem that's being addressed instead of > the 15 problems that aren't actual practical problems. Will get rid of those. > > > However, this implementation doesn't convert TDX private pages back to > > normal in kexec() because of below considerations: > > > > 1) Neither the kernel nor the TDX module has existing infrastructure to > > track which pages are TDX private pages. > > 2) The number of TDX private pages can be large, and converting all of > > them (cache flush + using MOVDIR64B to clear the page) in kexec() can > > be time consuming. > > 3) The new kernel will almost only use KeyID 0 to access memory. KeyID > > 0 doesn't support integrity-check, so it's OK. > > 4) The kernel doesn't (and may never) support MKTME. If any 3rd party > > kernel ever supports MKTME, it can/should do MOVDIR64B to clear the > > page with the new MKTME KeyID (just like TDX does) before using it. > > Yeah, why are we getting all worked up about MKTME when there is not > support? I am not sure whether we need to consider 3rd party kernel case? > > The only thing that matters here is dirty cacheline writeback. There > are two things the kernel needs to do to mitigate that: > > 1. Stop accessing TDX private memory mappings > 1a. Stop making TDX module calls (uses global private KeyID) > 1b. Stop TDX guests from running (uses per-guest KeyID) > 2. Flush any cachelines from previous private KeyID writes > > There are a couple of ways we can do #2. We do *NOT* need to convert > *ANYTHING* back to KeyID 0. Page conversion doesn't even come into play > in any way as far as I can tell. May I ask why? When I was writing this patch I was not sure whether kexec() should give the new kernel a clean slate. SGX driver doesn't EREMOVE all EPC during kexec() but depends on the new kernel to do that too, but I don't know what's the general guide of supporting kexec(). > > I think you're also saying that since all CPUs go through this path and > there is no TDX activity between the WBINVD and the native_halt() that > 1a and 1b basically happen for "free" without needing to do theme > explicitly. Yes. Should we mention this part in changelog? AMD SME kexec() support patch bba4ed011a52d ("x86/mm, kexec: Allow kexec to be used with SME") seems doesn't mention anything similar (SME and TDX may be different, though). > > > Therefore, this implementation just flushes cache to make sure there are > > no stale dirty cachelines associated with any TDX private KeyIDs before > > booting to the new kernel, otherwise they may silently corrupt the new > > kernel. > > That's fine. So, this patch kinda happens to land in the right spot > even after thrashing about about a while. > > > Following SME support, use wbinvd() to flush cache in stop_this_cpu(). > > Theoretically, cache flush is only needed when the TDX module has been > > initialized. However initializing the TDX module is done on demand at > > runtime, and it takes a mutex to read the module status. Just check > > whether TDX is enabled by BIOS instead to flush cache. > > Yeah, close enough. > > > diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c > > index c21b7347a26d..0cc84977dc62 100644 > > --- a/arch/x86/kernel/process.c > > +++ b/arch/x86/kernel/process.c > > @@ -765,8 +765,14 @@ void __noreturn stop_this_cpu(void *dummy) > > * > > * Test the CPUID bit directly because the machine might've cleared > > * X86_FEATURE_SME due to cmdline options. > > + * > > + * Similar to SME, if the TDX module is ever initialized, the > > + * cachelines associated with any TDX private KeyID must be flushed > > + * before transiting to the new kernel. The TDX module is initialized > > + * on demand, and it takes the mutex to read its status. Just check > > + * whether TDX is enabled by BIOS instead to flush cache. > > */ > > There's too much detail here. Let's up-level it a bit. We don't need > to be talking about TDX locking here. Sure will do. Thanks! > > /* > * The TDX module or guests might have left dirty cachelines > * behind. Flush them to avoid corruption from later writeback. > * Note that this flushes on all systems where TDX is possible, > * but does not actually check that TDX was in use. > */ > > > - if (cpuid_eax(0x8000001f) & BIT(0)) > > + if (cpuid_eax(0x8000001f) & BIT(0) || platform_tdx_enabled()) > > native_wbinvd(); > > for (;;) { > > /* > >