Re: [tip: x86/bugs] x86/retpoline: Ensure default return thunk isn't used at runtime

Borislav Petkov <bp@xxxxxxxxx> · Wed, 18 Oct 2023 19:55:31 +0200

On Wed, Oct 18, 2023 at 08:54:33AM -0700, Josh Poimboeuf wrote:
> On Wed, Oct 18, 2023 at 05:12:45PM +0200, Borislav Petkov wrote:
> > On Wed, Oct 18, 2023 at 03:38:56PM +0200, Ingo Molnar wrote:
> > > If then WARN_ONCE().
> > 
> > WARN_ONCE() is not enough considering that if this fires, it means we're
> > not really properly protected against one of those RET-speculation
> > things.
> > 
> > It needs to be warning constantly but then still allow booting. I.e,
> > a ratelimited warn of sorts but I don't think we have that... yet.
> 
> I'm not sure a rate-limited WARN() would be a good thing.  Either the
> user is regularly checking dmesg (most likely in some automated fashion)
> or they're not.  If the latter, a rate-limited WARN() would wrap dmesg
> pretty quickly.

Well, freezing the box without any mention about why it happens is not
viable either. So for lack of a better solution, overflowing dmesg is
all we could do.

And, on a related note, I'm thinking I should revert:

  e92626af3234 ("x86/retpoline: Remove .text..__x86.return_thunk section")

after all because I'm debugging another similar issue reported by
dhowells.

And I can reproduce it on linux-next with his config and gcc-13. The
splat looks like this below - and mind you, that's in a VM. On baremetal
you get to see only the first warning and output stops.

And that happens because for whatever reason apply_returns() can't find
that last jmp __x86_return_thunk for %r15 and it barfs.

When I revert e92626af3234, it is fixed. It fixes dhowells' box too.

Which means, IMHO, objtool is missing to add a return return call site
at the end of that __x86_indirect_thunk_r15.

And considering how close we are to the merge window, I'd let that
.text..__x86.return_thunk section exist so that objtool can find the
return sites more reliably that what we currently have.

We can always do e92626af3234 later, when it has seen more testing.

Now, to the UD2 case - look below at "* first splat".

Stack protector fires but there's no #UD exception. Well, there is, it
is well hidden:

(gdb) bt
#0  delay_tsc (cycles=3670543) at arch/x86/lib/delay.c:90
#1  0xffffffff810c706e in panic (fmt=fmt@entry=0xffffffff82504fe4 "stack-protector: Kernel stack is corrupted in: %pB")
    at kernel/panic.c:456
#2  0xffffffff81d64afb in __stack_chk_fail () at kernel/panic.c:763
#3  0xffffffff810e9333 in notify_die (val=val@entry=DIE_TRAP, str=str@entry=0xffffffff824f49b8 "invalid opcode", 
    regs=regs@entry=0xffff8880794000a8, err=err@entry=0, trap=trap@entry=6, sig=sig@entry=4) at kernel/notifier.c:597
#4  0xffffffff8101d4fb in do_error_trap (regs=regs@entry=0xffff8880794000a8, error_code=error_code@entry=0, 
    str=str@entry=0xffffffff824f49b8 "invalid opcode", trapnr=trapnr@entry=6, signr=signr@entry=4, sicode=sicode@entry=2, 
    addr=0xffffffff81d712a0 <__x86_return_thunk>) at arch/x86/kernel/traps.c:170
#5  0xffffffff81d62355 in handle_invalid_op (regs=0xffff8880794000a8) at ./arch/x86/include/asm/ptrace.h:209a
^^^^^^^^^^^^^^^^

#6  exc_invalid_op (regs=0xffff8880794000a8) at arch/x86/kernel/traps.c:263
#7  0xffffffff81e00a96 in asm_exc_invalid_op () at ./arch/x86/include/asm/idtentry.h:568
#8  0xffffffff81d49ff4 in cmp_ex_sort (a=0x38020f, b=0x761d61d8b) at lib/extable.c:61
#9  0x000000000000000c in fixed_percpu_data ()
#10 0xffff888079400198 in ?? ()
#11 0xffffffff826cdc70 in __modver_attr ()
#12 0x0000000000000270 in ?? ()
#13 0xffffffff826ceb10 in __start___ex_table ()
#14 0x0000000000000000 in ?? ()
(gdb)

*while* it runs the #UD exception handler, stackprotector determines
that the stack has been corrupted, leading to that panic.

And nothing in dmesg tells the user what's really going on.

And with the warn, you can actually see it:

------------[ cut here ]------------
Unconverted return thunk
WARNING: CPU: 0 PID: 1 at arch/x86/kernel/cpu/bugs.c:2855 check_thunks+0x11/0x1a
Modules linked in:
...
----

I still need to figure out, though, how to make check_thunks *not* have a "jmp
__x86_return_thunk" at the end itself because it gets loopy. :)

* first splat
-------------
...
x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
------------[ cut here ]------------
missing return thunk: __x86_indirect_thunk_r15+0xa/0x5f-0x0: eb 74 66 66 2e
WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:755 apply_returns+0xca/0x247
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.6.0-rc6-next-20231018-build3 #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:apply_returns+0xca/0x247
Code: 80 3d bd e1 aa 01 00 75 b4 49 89 d8 48 89 ea 48 89 de c6 05 ab e1 aa 01 01 b9 05 00 00 00 48 c7 c7 45 65 4f 82 e8 0b 10 0a 00 <0f> 0b eb 8f f6 05 36 2e 1f 02 02 74 26 0f b6 54 24 52 48 89 de 48
RSP: 0000:ffffffff82803e30 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffffffff81d7122a RCX: 0000000000000003
RDX: 0000000000000086 RSI: 00000000fff7ffff RDI: 0000000000000001
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff82803969 R12: ffffffff831fed00
R13: 0000000000000005 R14: ffffffff831fed18 R15: 0000000000013af0
FS:  0000000000000000(0000) GS:ffff888079400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff888003401000 CR3: 0000000002854000 CR4: 00000000001506f0
Call Trace:
 <TASK>
 ? __warn+0x8c/0xf6
 ? report_bug+0xbf/0x11f
 ? apply_returns+0xca/0x247
 ? handle_bug+0x3c/0x63
 ? exc_invalid_op+0x13/0x5d
 ? asm_exc_invalid_op+0x16/0x20
 ? __x86_indirect_thunk_r15+0xa/0x5f
 ? apply_returns+0xca/0x247
 ? __x86_indirect_thunk_r15+0xa/0x5f
 ? __x86_indirect_thunk_r15+0x19/0x5f
 ? __x86_indirect_thunk_r15+0xc/0x5f
 alternative_instructions+0x35/0xe2
 arch_cpu_finalize_init+0xba/0xdb
 start_kernel+0x4a1/0x524
 x86_64_start_reservations+0x25/0x25
 x86_64_start_kernel+0x73/0x73
 secondary_startup_64_no_verify+0x166/0x16b
 </TASK>
---[ end trace 0000000000000000 ]---
Freeing SMP alternatives memory: 36K
pid_max: default: 32768 minimum: 301
LSM: initializing lsm=capability,yama,selinux
Yama: becoming mindful.
SELinux:  Initializing.
Mount-cache hash table entries: 4096 (order: 3, 32768 bytes, linear)
Mountpoint-cache hash table entries: 4096 (order: 3, 32768 bytes, linear)
smpboot: CPU0: Intel Core Processor (Haswell, no TSX) (family: 0x6, model: 0x3c, stepping: 0x1)
RCU Tasks Rude: Setting shift to 2 and lim to 1 rcu_task_cb_adjust=1.
RCU Tasks Trace: Setting shift to 2 and lim to 1 rcu_task_cb_adjust=1.
Performance Events: unsupported p6 CPU model 60 no PMU driver, software events only.
signal: max sigframe size: 1776
rcu: Hierarchical SRCU implementation.
rcu: 	Max phase no-delay instances is 1000.
NMI watchdog: Perf NMI watchdog permanently disabled
smp: Bringing up secondary CPUs ...
BUG: spinlock bad magic on CPU#0, swapper/0/1
Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: notify_die+0x52/0x5b
CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W          6.6.0-rc6-next-20231018-build3 #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Call Trace:
---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: notify_die+0x52/0x5b ]---

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette