* Ingo Molnar <mingo@xxxxxxxxxx> wrote: > * David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote: > > > But wait, why did I say "mostly"? Well, not everyone has a retpoline > > compiler yet... but OK, screw them; they need to update. > > > > Then there's Skylake, and that generation of CPU cores. For complicated > > reasons they actually end up being vulnerable not just on indirect > > branches, but also on a 'ret' in some circumstances (such as 16+ CALLs > > in a deep chain). > > > > The IBRS solution, ugly though it is, did address that. Retpoline > > doesn't. There are patches being floated to detect and prevent deep > > stacks, and deal with some of the other special cases that bite on SKL, > > but those are icky too. And in fact IBRS performance isn't anywhere > > near as bad on this generation of CPUs as it is on earlier CPUs > > *anyway*, which makes it not quite so insane to *contemplate* using it > > as Intel proposed. > > There's another possible method to avoid deep stacks on Skylake, without compiler > support: > > - Use the existing mcount based function tracing live patching machinery > (CONFIG_FUNCTION_TRACER=y) to install a _very_ fast and simple stack depth > tracking tracer which would issue a retpoline when stack depth crosses > boundaries of ~16 entries. The patch below demonstrates the principle, it forcibly enables dynamic ftrace patching (CONFIG_DYNAMIC_FTRACE=y et al) and turns mcount/__fentry__ into a RET: ffffffff81a01a40 <__fentry__>: ffffffff81a01a40: c3 retq This would have to be extended with (very simple) call stack depth tracking (just 3 more instructions would do in the fast path I believe) and a suitable SkyLake workaround (and also has to play nice with the ftrace callbacks). On non-SkyLake the overhead would be 0 cycles. On SkyLake this would add an overhead of maybe 2-3 cycles per function call and obviously all this code and data would be very cache hot. Given that the average number of function calls per system call is around a dozen, this would be _much_ faster than any microcode/MSR based approach. Is there a testcase for the SkyLake 16-deep-call-stack problem that I could run? Is there a description of the exact speculative execution vulnerability that has to be addressed to begin with? If this approach is workable I'd much prefer it to any MSR writes in the syscall entry path not just because it's fast enough in practice to not be turned off by everyone, but also because everyone would agree that per function call overhead needs to go away on new CPUs. Both deployment and backporting is also _much_ more flexible, simpler, faster and more complete than microcode/firmware or compiler based solutions. Assuming the vulnerability can be addressed via this route that is, which is a big assumption! Thanks, Ingo arch/x86/Kconfig | 3 +++ arch/x86/kernel/ftrace_64.S | 1 + 2 files changed, 4 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 423e4b64e683..df471538a79c 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -133,6 +133,8 @@ config X86 select HAVE_DMA_CONTIGUOUS select HAVE_DYNAMIC_FTRACE select HAVE_DYNAMIC_FTRACE_WITH_REGS + select DYNAMIC_FTRACE + select DYNAMIC_FTRACE_WITH_REGS select HAVE_EBPF_JIT if X86_64 select HAVE_EFFICIENT_UNALIGNED_ACCESS select HAVE_EXIT_THREAD @@ -140,6 +142,7 @@ config X86 select HAVE_FTRACE_MCOUNT_RECORD select HAVE_FUNCTION_GRAPH_TRACER select HAVE_FUNCTION_TRACER + select FUNCTION_TRACER select HAVE_GCC_PLUGINS select HAVE_HW_BREAKPOINT select HAVE_IDE diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S index 7cb8ba08beb9..1e219e0f2887 100644 --- a/arch/x86/kernel/ftrace_64.S +++ b/arch/x86/kernel/ftrace_64.S @@ -19,6 +19,7 @@ EXPORT_SYMBOL(__fentry__) # define function_hook mcount EXPORT_SYMBOL(mcount) #endif + ret /* All cases save the original rbp (8 bytes) */ #ifdef CONFIG_FRAME_POINTER