On 3/13/23 18:50, Huang, Kai wrote: > On Mon, 2023-03-13 at 16:49 -0700, Isaku Yamahata wrote: >> On Sun, Mar 12, 2023 at 11:08:44PM +0000, >> "Huang, Kai" <kai.huang@xxxxxxxxx> wrote: >> >>> On Wed, 2023-03-08 at 14:27 -0800, Isaku Yamahata wrote: >>>>> + >>>>> +static int try_init_module_global(void) >>>>> +{ >>>>> + int ret; >>>>> + >>>>> + /* >>>>> + * The TDX module global initialization only needs to be done >>>>> + * once on any cpu. >>>>> + */ >>>>> + spin_lock(&tdx_global_init_lock); >>>>> + >>>>> + if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) { >>>>> + ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ? >>>>> + -EINVAL : 0; >>>>> + goto out; >>>>> + } >>>>> + >>>>> + /* All '0's are just unused parameters. */ >>>>> + ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL); >>>>> + >>>>> + tdx_global_init_status = TDX_GLOBAL_INIT_DONE; >>>>> + if (ret) >>>>> + tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED; >>>> >>>> If entropy is lacking (rdrand failure), TDH_SYS_INIT can return TDX_SYS_BUSY. >>>> In such case, we should allow the caller to retry or make this function retry >>>> instead of marking error stickily. >>> >>> The spec says: >>> >>> TDX_SYS_BUSY The operation was invoked when another TDX module >>> operation was in progress. The operation may be retried. >>> >>> So I don't see how entropy is lacking is related to this error. Perhaps you >>> were mixing up with KEY.CONFIG? >> >> TDH.SYS.INIT() initializes global canary value. TDX module is compiled with >> strong stack protector enabled by clang and canary value needs to be >> initialized. By default, the canary value is stored at >> %fsbase:<STACK_CANARY_OFFSET 0x28> >> >> Although this is a job for libc or language runtime, TDX modules has to do it >> itself because it's stand alone. >> >> From tdh_sys_init.c >> _STATIC_INLINE_ api_error_type tdx_init_stack_canary(void) >> { >> ia32_rflags_t rflags = {.raw = 0}; >> uint64_t canary; >> if (!ia32_rdrand(&rflags, &canary)) >> { >> return TDX_SYS_BUSY; >> } >> ... >> last_page_ptr->stack_canary.canary = canary; >> >> > > Then it is a hidden behaviour of the TDX module that is not reflected in the > spec. This is true. Could you please go ask the TDX module folks to fix this up? > I am not sure whether we should handle because: > > 1) This is an extremely rare case. Kernel would be basically under attack if > such error happened. In the current series we don't handle such case in > KEY.CONFIG either but just leave a comment (see patch 13). Rare, yes. Under attack? I'm not sure where you get that from. Look at the SDM: > Under heavy load, with multiple cores executing RDRAND in parallel, it is possible, though unlikely, for the demand > of random numbers by software processes/threads to exceed the rate at which the random number generator > hardware can supply them. This will lead to the RDRAND instruction returning no data transitorily. The RDRAND > instruction indicates the occurrence of this rare situation by clearing the CF flag. That doesn't talk about attacks. > 2) Not sure whether this will be changed in the future. > > So I think we should keep as is. TDX_SYS_BUSY really is missing some nuance. You *REALLY* want to retry RDRAND failures. But, if you have VMM locking and don't expect two users calling into the TDX module then TDX_SYS_BUSY from a busy *module* is a bad (and probably fatal) signal. I suspect we should just throw a few retries in the seamcall() infrastructure to retry in the case of TDX_SYS_BUSY. It'll take care of RDRAND failures. If a retry loop fails to resolve it, then we should probably dump a warning and return an error. Just do this once, in common code.