On Tue, Feb 4, 2025 at 9:32 AM Dave Hansen <dave.hansen@xxxxxxxxx> wrote: > > I think this is the right fix for _now_. In practice, Vishal's problem > only occurs on CONFIG_PARAVIRT_XXL systems. His proposed fix here does > not make TDX depend on CONFIG_PARAVIRT_XXL, it just provides an extra > override when TDX and CONFIG_PARAVIRT_XXL collide. > > This seems like a reasonable compromise that avoids entangling > PARAVIRT_XXL and TDX _too_ much and also avoids reinventing a hunk of > PARAVIRT_XXL just to fix this bug. To ensure that we spend a bit more time here, would folks be ok with making TDX depend on CONFIG_PARAVIRT_XXL as a stopgap until we have the long term proposal Dave mentioned below to cleanly separate "pv_ops.irq.safe_halt()" from paravirt infra? > > Long-term, I think it would be nice to move pv_ops.irq.safe_halt() away > from being a paravirt thing and move it over to a plain static_call(). > > Then, TDX can get rid of this hunk: > > pr_info("using TDX aware idle routine\n"); > static_call_update(x86_idle, tdx_safe_halt); > > and move back to default_idle() which could look like this: > > void __cpuidle default_idle(void) > { > - raw_safe_halt(); > + static_call(x86_safe_halt)(); > raw_local_irq_disable(); > } > > If 'x86_safe_halt' was the only route in the kernel to call 'sti;hlt' > then we can know with pretty high confidence if TDX or Xen code sets > their own 'x86_safe_halt' that they won't run into more bugs like this one. > > On to the patch itself... > > On 1/29/25 15:25, Vishal Annapurve wrote: > > Direct HLT instruction execution causes #VEs for TDX VMs which is routed > > to hypervisor via tdvmcall. This process renders HLT instruction > > execution inatomic, so any preceding instructions like STI/MOV SS will > > end up enabling interrupts before the HLT instruction is routed to the > > hypervisor. This creates scenarios where interrupts could land during > > HLT instruction emulation without aborting halt operation leading to > > idefinite halt wait times. > > Vishal! I'm noticing spelling issues right up front and center here, > just like in v1. I think I asked nicely last time if you could start > spell checking your changelogs before posting v2. Any chance you could > actually put some spell checking in place before v3? Yeah, will do. I incorrectly thought that codespell runs by default with checkpatch.pl. > > So, the x86 STI-shadow mechanism has left a trail of tears. We don't > want to explain the whole sordid tale here, but I don't feel like > talking about the "what" (atomic vs. inatomic execution) without > explaining "why" is really sufficient to explain the problem at hand. > > Sean had a pretty concise description in here that I liked: > > https://lore.kernel.org/all/Z5l6L3Hen9_Y3SGC@xxxxxxxxxx/ > > But the net result is that it is currently unsafe for TDX guests to use > the "sti;hlt" sequence. It's really important to say *that* somewhere. > Ack. > > Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests") already > > upgraded x86_idle() to invoke tdvmcall to avoid such scenarios, but > > it didn't cover pv_native_safe_halt() which can be invoked using > > raw_safe_halt() from call sites like acpi_safe_halt(). > > Does this convey the same thing? > > Commit bfe6ed0c6727 ("x86/tdx: Add HLT support for TDX guests") > prevented the idle routines from using "sti;hlt". But it missed the > paravirt routine. That can be reached like this, for example: > > acpi_safe_halt() => > raw_safe_halt() => > arch_safe_halt() => > irq.safe_halt() => > pv_native_safe_halt() > > I also dislike the "upgrade" nomenclature. It's not really an "upgrade". Ack. > > ... > > @@ -380,13 +381,18 @@ static int handle_halt(struct ve_info *ve) > > { > > const bool irq_disabled = irqs_disabled(); > > > > + if (!irq_disabled) { > > + WARN_ONCE(1, "HLT instruction emulation unsafe with irqs enabled\n"); > > + return -EIO; > > + } > > The warning is fine, but I do think it should be separated from the bug fix. > > > > > -void __cpuidle tdx_safe_halt(void) > > +void __cpuidle tdx_idle(void) > > { > > const bool irq_disabled = false; > > > > @@ -397,6 +403,12 @@ void __cpuidle tdx_safe_halt(void) > > WARN_ONCE(1, "HLT instruction emulation failed\n"); > > } > > > > +static void __cpuidle tdx_safe_halt(void) > > +{ > > + tdx_idle(); > > + raw_local_irq_enable(); > > +} > > The naming here is a bit wonky. Think of how the call chain will look: > > irq.safe_halt() => > tdx_safe_halt() => > tdx_idle() => > __halt() > > See how it's doing a more and more TDX-specific halt operation? Isn't > the "idle" call right in the middle confusing? Makes sense. > > > static int read_msr(struct pt_regs *regs, struct ve_info *ve) > > { > > struct tdx_module_args args = { > > @@ -1083,6 +1095,15 @@ void __init tdx_early_init(void) > > x86_platform.guest.enc_kexec_begin = tdx_kexec_begin; > > x86_platform.guest.enc_kexec_finish = tdx_kexec_finish; > > > > +#ifdef CONFIG_PARAVIRT_XXL > > + /* > > + * halt instruction execution is not atomic for TDX VMs as it generates > > + * #VEs, so otherwise "safe" halt invocations which cause interrupts to > > + * get enabled right after halt instruction don't work for TDX VMs. > > + */ > > + pv_ops.irq.safe_halt = tdx_safe_halt; > > +#endif > > Just like the changelog, it's hard to write a good comment without going > into the horrors of the STI-shadow. But I think this is a bit more to > the point: > > /* > * Avoid the literal hlt instruction in TDX guests. hlt will > * induce a #VE in the STI-shadow which will enable interrupts > * in a place where they are not wanted. > */ > > Ack, will take care of these comments in v3.