On Wed, 2023-03-15 at 11:10 +0000, Huang, Kai wrote: > On Tue, 2023-03-14 at 08:48 -0700, Dave Hansen wrote: > > On 3/13/23 18:50, Huang, Kai wrote: > > > On Mon, 2023-03-13 at 16:49 -0700, Isaku Yamahata wrote: > > > > On Sun, Mar 12, 2023 at 11:08:44PM +0000, > > > > "Huang, Kai" <kai.huang@xxxxxxxxx> wrote: > > > > > > > > > On Wed, 2023-03-08 at 14:27 -0800, Isaku Yamahata wrote: > > > > > > > + > > > > > > > +static int try_init_module_global(void) > > > > > > > +{ > > > > > > > + int ret; > > > > > > > + > > > > > > > + /* > > > > > > > + * The TDX module global initialization only needs to be done > > > > > > > + * once on any cpu. > > > > > > > + */ > > > > > > > + spin_lock(&tdx_global_init_lock); > > > > > > > + > > > > > > > + if (tdx_global_init_status & TDX_GLOBAL_INIT_DONE) { > > > > > > > + ret = tdx_global_init_status & TDX_GLOBAL_INIT_FAILED ? > > > > > > > + -EINVAL : 0; > > > > > > > + goto out; > > > > > > > + } > > > > > > > + > > > > > > > + /* All '0's are just unused parameters. */ > > > > > > > + ret = seamcall(TDH_SYS_INIT, 0, 0, 0, 0, NULL, NULL); > > > > > > > + > > > > > > > + tdx_global_init_status = TDX_GLOBAL_INIT_DONE; > > > > > > > + if (ret) > > > > > > > + tdx_global_init_status |= TDX_GLOBAL_INIT_FAILED; > > > > > > > > > > > > If entropy is lacking (rdrand failure), TDH_SYS_INIT can return TDX_SYS_BUSY. > > > > > > In such case, we should allow the caller to retry or make this function retry > > > > > > instead of marking error stickily. > > > > > > > > > > The spec says: > > > > > > > > > > TDX_SYS_BUSY The operation was invoked when another TDX module > > > > > operation was in progress. The operation may be retried. > > > > > > > > > > So I don't see how entropy is lacking is related to this error. Perhaps you > > > > > were mixing up with KEY.CONFIG? > > > > > > > > TDH.SYS.INIT() initializes global canary value. TDX module is compiled with > > > > strong stack protector enabled by clang and canary value needs to be > > > > initialized. By default, the canary value is stored at > > > > %fsbase:<STACK_CANARY_OFFSET 0x28> > > > > > > > > Although this is a job for libc or language runtime, TDX modules has to do it > > > > itself because it's stand alone. > > > > > > > > From tdh_sys_init.c > > > > _STATIC_INLINE_ api_error_type tdx_init_stack_canary(void) > > > > { > > > > ia32_rflags_t rflags = {.raw = 0}; > > > > uint64_t canary; > > > > if (!ia32_rdrand(&rflags, &canary)) > > > > { > > > > return TDX_SYS_BUSY; > > > > } > > > > ... > > > > last_page_ptr->stack_canary.canary = canary; > > > > > > > > > > > > > > Then it is a hidden behaviour of the TDX module that is not reflected in the > > > spec. > > > > This is true. Could you please go ask the TDX module folks to fix this up? > > Sure will do. > > To make sure, you mean we should ask TDX module guys to add the new > TDX_RND_NO_ENTROPY error code to TDX module 1.0? > > "another TDX module operation was in progress" and "running out of entropy" are > different thing and should not be mixed together IMHO. > > > > > > I am not sure whether we should handle because: > > > > > > 1) This is an extremely rare case. Kernel would be basically under attack if > > > such error happened. In the current series we don't handle such case in > > > KEY.CONFIG either but just leave a comment (see patch 13). > > > > Rare, yes. Under attack? I'm not sure where you get that from. Look > > at the SDM: > > > > > Under heavy load, with multiple cores executing RDRAND in parallel, it is possible, though unlikely, for the demand > > > of random numbers by software processes/threads to exceed the rate at which the random number generator > > > hardware can supply them. This will lead to the RDRAND instruction returning no data transitorily. The RDRAND > > > instruction indicates the occurrence of this rare situation by clearing the CF flag. > > > > That doesn't talk about attacks. > > Thanks for citing the documentation. I checked the kernel code before and it > seems currently there's no code to call RDRAND very frequently. But yes we > should not say "under attack". I have some old memory that someone said so > (maybe me?). > > > > > > 2) Not sure whether this will be changed in the future. > > > > > > So I think we should keep as is. > > > > TDX_SYS_BUSY really is missing some nuance. You *REALLY* want to retry > > RDRAND failures. > > > > OK. Agreed. Then I think the TDH.SYS.KEY.CONFIG should retry when running out > of entropy too. > > > But, if you have VMM locking and don't expect two > > users calling into the TDX module then TDX_SYS_BUSY from a busy *module* > > is a bad (and probably fatal) signal. > > Yes we have a lock to protect TDH.SYS.INIT from being called in parallel. W/o > this entropy thing TDX_SYS_BUSY should never happen. > > > > > I suspect we should just throw a few retries in the seamcall() > > infrastructure to retry in the case of TDX_SYS_BUSY. It'll take care of > > RDRAND failures. If a retry loop fails to resolve it, then we should > > probably dump a warning and return an error. > > > > Just do this once, in common code. > > I can do. Just want to make sure do you want to retry TDX_SYS_BUSY, or retry > TDX_RND_NO_ENTROPY (if we want to ask TDX module guys to change to return this > value)? > > Also, even we retry either TDX_SYS_BUSY or TDX_RND_NO_ENTROPY in common > seamcall() code, it doesn't handle the TDH.SYS.KEY.CONFIG, because sadly this > SEAMCALL returns a different error code: > > TDX_KEY_GENERATION_FAILED Failed to generate a random key. This is > typically caused by an entropy error of the > CPU's random number generator, and may > be impacted by RDSEED, RDRAND or PCONFIG > executing on other LPs. The operation should be > retried. > Hi Dave, Sorry to ping. Could you help to check whether my understanding is aligned with what you suggested?