On Mon, 2023-06-19 at 15:16 +0200, David Hildenbrand wrote: > On 04.06.23 16:27, Kai Huang wrote: > > To enable TDX the kernel needs to initialize TDX from two perspectives: > > 1) Do a set of SEAMCALLs to initialize the TDX module to make it ready > > to create and run TDX guests; 2) Do the per-cpu initialization SEAMCALL > > on one logical cpu before the kernel wants to make any other SEAMCALLs > > on that cpu (including those involved during module initialization and > > running TDX guests). > > > > The TDX module can be initialized only once in its lifetime. Instead > > of always initializing it at boot time, this implementation chooses an > > "on demand" approach to initialize TDX until there is a real need (e.g > > when requested by KVM). This approach has below pros: > > > > 1) It avoids consuming the memory that must be allocated by kernel and > > given to the TDX module as metadata (~1/256th of the TDX-usable memory), > > and also saves the CPU cycles of initializing the TDX module (and the > > metadata) when TDX is not used at all. > > > > 2) The TDX module design allows it to be updated while the system is > > running. The update procedure shares quite a few steps with this "on > > demand" initialization mechanism. The hope is that much of "on demand" > > mechanism can be shared with a future "update" mechanism. A boot-time > > TDX module implementation would not be able to share much code with the > > update mechanism. > > > > 3) Making SEAMCALL requires VMX to be enabled. Currently, only the KVM > > code mucks with VMX enabling. If the TDX module were to be initialized > > separately from KVM (like at boot), the boot code would need to be > > taught how to muck with VMX enabling and KVM would need to be taught how > > to cope with that. Making KVM itself responsible for TDX initialization > > lets the rest of the kernel stay blissfully unaware of VMX. > > > > Similar to module initialization, also make the per-cpu initialization > > "on demand" as it also depends on VMX being enabled. > > > > Add two functions, tdx_enable() and tdx_cpu_enable(), to enable the TDX > > module and enable TDX on local cpu respectively. For now tdx_enable() > > is a placeholder. The TODO list will be pared down as functionality is > > added. > > > > In tdx_enable() use a state machine protected by mutex to make sure the > > initialization will only be done once, as tdx_enable() can be called > > multiple times (i.e. KVM module can be reloaded) and may be called > > concurrently by other kernel components in the future. > > > > The per-cpu initialization on each cpu can only be done once during the > > module's life time. Use a per-cpu variable to track its status to make > > sure it is only done once in tdx_cpu_enable(). > > > > Also, a SEAMCALL to do TDX module global initialization must be done > > once on any logical cpu before any per-cpu initialization SEAMCALL. Do > > it inside tdx_cpu_enable() too (if hasn't been done). > > > > tdx_enable() can potentially invoke SEAMCALLs on any online cpus. The > > per-cpu initialization must be done before those SEAMCALLs are invoked > > on some cpu. To keep things simple, in tdx_cpu_enable(), always do the > > per-cpu initialization regardless of whether the TDX module has been > > initialized or not. And in tdx_enable(), don't call tdx_cpu_enable() > > but assume the caller has disabled CPU hotplug, done VMXON and > > tdx_cpu_enable() on all online cpus before calling tdx_enable(). > > > > Signed-off-by: Kai Huang <kai.huang@xxxxxxxxx> > > Reviewed-by: Isaku Yamahata <isaku.yamahata@xxxxxxxxx> > > --- > > > > v10 -> v11: > > - Return -NODEV instead of -EINVAL when CONFIG_INTEL_TDX_HOST is off. > > - Return the actual error code for tdx_enable() instead of -EINVAL. > > - Added Isaku's Reviewed-by. > > > > v9 -> v10: > > - Merged the patch to handle per-cpu initialization to this patch to > > tell the story better. > > - Changed how to handle the per-cpu initialization to only provide a > > tdx_cpu_enable() function to let the user of TDX to do it when the > > user wants to run TDX code on a certain cpu. > > - Changed tdx_enable() to not call cpus_read_lock() explicitly, but > > call lockdep_assert_cpus_held() to assume the caller has done that. > > - Improved comments around tdx_enable() and tdx_cpu_enable(). > > - Improved changelog to tell the story better accordingly. > > > > v8 -> v9: > > - Removed detailed TODO list in the changelog (Dave). > > - Added back steps to do module global initialization and per-cpu > > initialization in the TODO list comment. > > - Moved the 'enum tdx_module_status_t' from tdx.c to local tdx.h > > > > v7 -> v8: > > - Refined changelog (Dave). > > - Removed "all BIOS-enabled cpus" related code (Peter/Thomas/Dave). > > - Add a "TODO list" comment in init_tdx_module() to list all steps of > > initializing the TDX Module to tell the story (Dave). > > - Made tdx_enable() unverisally return -EINVAL, and removed nonsense > > comments (Dave). > > - Simplified __tdx_enable() to only handle success or failure. > > - TDX_MODULE_SHUTDOWN -> TDX_MODULE_ERROR > > - Removed TDX_MODULE_NONE (not loaded) as it is not necessary. > > - Improved comments (Dave). > > - Pointed out 'tdx_module_status' is software thing (Dave). > > > > v6 -> v7: > > - No change. > > > > v5 -> v6: > > - Added code to set status to TDX_MODULE_NONE if TDX module is not > > loaded (Chao) > > - Added Chao's Reviewed-by. > > - Improved comments around cpus_read_lock(). > > > > - v3->v5 (no feedback on v4): > > - Removed the check that SEAMRR and TDX KeyID have been detected on > > all present cpus. > > - Removed tdx_detect(). > > - Added num_online_cpus() to MADT-enabled CPUs check within the CPU > > hotplug lock and return early with error message. > > - Improved dmesg printing for TDX module detection and initialization. > > > > > > --- > > arch/x86/include/asm/tdx.h | 4 + > > arch/x86/virt/vmx/tdx/tdx.c | 179 ++++++++++++++++++++++++++++++++++++ > > arch/x86/virt/vmx/tdx/tdx.h | 13 +++ > > 3 files changed, 196 insertions(+) > > > > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h > > index b489b5b9de5d..03f74851608f 100644 > > --- a/arch/x86/include/asm/tdx.h > > +++ b/arch/x86/include/asm/tdx.h > > @@ -102,8 +102,12 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, > > > > #ifdef CONFIG_INTEL_TDX_HOST > > bool platform_tdx_enabled(void); > > +int tdx_cpu_enable(void); > > +int tdx_enable(void); > > #else /* !CONFIG_INTEL_TDX_HOST */ > > static inline bool platform_tdx_enabled(void) { return false; } > > +static inline int tdx_cpu_enable(void) { return -ENODEV; } > > +static inline int tdx_enable(void) { return -ENODEV; } > > #endif /* CONFIG_INTEL_TDX_HOST */ > > > > #endif /* !__ASSEMBLY__ */ > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c > > index e62e978eba1b..bcf2b2d15a2e 100644 > > --- a/arch/x86/virt/vmx/tdx/tdx.c > > +++ b/arch/x86/virt/vmx/tdx/tdx.c > > @@ -13,6 +13,10 @@ > > #include <linux/errno.h> > > #include <linux/printk.h> > > #include <linux/smp.h> > > +#include <linux/cpu.h> > > +#include <linux/spinlock.h> > > +#include <linux/percpu-defs.h> > > +#include <linux/mutex.h> > > #include <asm/msr-index.h> > > #include <asm/msr.h> > > #include <asm/archrandom.h> > > @@ -23,6 +27,18 @@ static u32 tdx_global_keyid __ro_after_init; > > static u32 tdx_guest_keyid_start __ro_after_init; > > static u32 tdx_nr_guest_keyids __ro_after_init; > > > > +static unsigned int tdx_global_init_status; > > +static DEFINE_RAW_SPINLOCK(tdx_global_init_lock); > > +#define TDX_GLOBAL_INIT_DONE _BITUL(0) > > +#define TDX_GLOBAL_INIT_FAILED _BITUL(1) > > + > > +static DEFINE_PER_CPU(unsigned int, tdx_lp_init_status); > > +#define TDX_LP_INIT_DONE _BITUL(0) > > +#define TDX_LP_INIT_FAILED _BITUL(1) > > I'm curious, why do we have to track three states: uninitialized > (!done), initialized (done + ! failed), permanent error (done + failed). > > [besides: why can't you use an enum and share that between global and pcpu?] > > Why can't you have a pcpu "bool tdx_lp_initialized" and "bool > tdx_global_initialized"? > > I mean, if there was an error during previous initialization, it's not > initialized: you'd try initializing again -- and possibly fail again -- > on the next attempt. I doubt that a "try to cache failed status to keep > failing fast" is really required. > > Is there any other reason (e.g., second init attempt would set your > computer on fire) why it can't be simpler? No other reasons but only the one that you mentioned above: I didn't want to retry in case of permanent error. Yes I agree we can have a pcpu "bool tdx_lp_initialized" and a "bool tdx_global_initialized" to simplify the logic. Thanks!