Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description

Szabolcs Nagy <szabolcs.nagy@xxxxxxx> · Wed, 1 Mar 2023 14:21:41 +0000

The 02/27/2023 14:29, Rick Edgecombe wrote:
> +Application Enabling
> +====================
> +
> +An application's CET capability is marked in its ELF note and can be verified
> +from readelf/llvm-readelf output::
> +
> +    readelf -n <application> | grep -a SHSTK
> +        properties: x86 feature: SHSTK
> +
> +The kernel does not process these applications markers directly. Applications
> +or loaders must enable CET features using the interface described in section 4.
> +Typically this would be done in dynamic loader or static runtime objects, as is
> +the case in GLIBC.

Note that this has to be an early decision in libc (ld.so or static
exe start code), which will be difficult to hook into system wide
security policy settings. (e.g. to force shstk on marked binaries.)

>From userspace POV I'd prefer if a static exe did not have to parse
its own ELF notes (i.e. kernel enabled shstk based on the marking).
But I realize if there is a need for complex shstk enable/disable
decision that is better in userspace and if the kernel decision can
be overridden then it might as well all be in userspace.

> +Enabling arch_prctl()'s
> +=======================
> +
> +Elf features should be enabled by the loader using the below arch_prctl's. They
> +are only supported in 64 bit user applications.
> +
> +arch_prctl(ARCH_SHSTK_ENABLE, unsigned long feature)
> +    Enable a single feature specified in 'feature'. Can only operate on
> +    one feature at a time.
> +
> +arch_prctl(ARCH_SHSTK_DISABLE, unsigned long feature)
> +    Disable a single feature specified in 'feature'. Can only operate on
> +    one feature at a time.
> +
> +arch_prctl(ARCH_SHSTK_LOCK, unsigned long features)
> +    Lock in features at their current enabled or disabled status. 'features'
> +    is a mask of all features to lock. All bits set are processed, unset bits
> +    are ignored. The mask is ORed with the existing value. So any feature bits
> +    set here cannot be enabled or disabled afterwards.

The multi-thread behaviour should be documented here: Only the
current thread is affected. So an application can only change the
setting while single-threaded which is only guaranteed before any
user code is executed. Later using the prctl is complicated and
most c runtimes would not want to do that (async signalling all
threads and prctl from the handler).

In particular these interfaces are not suitable to turn shstk off
at dlopen time when an unmarked binary is loaded. Or any other
late shstk policy change will not work, so as far as i can see
the "permissive" mode in glibc does not work.

Does the main thread have shadow stack allocated before shstk is
enabled? is the shadow stack freed when it is disabled? (e.g.
what would the instruction reading the SSP do in disabled state?)

> +Proc Status
> +===========
> +To check if an application is actually running with shadow stack, the
> +user can read the /proc/$PID/status. It will report "wrss" or "shstk"
> +depending on what is enabled. The lines look like this::
> +
> +    x86_Thread_features: shstk wrss
> +    x86_Thread_features_locked: shstk wrss

Presumaly /proc/$TID/status and /proc/$PID/task/$TID/status also
shows the setting and only valid for the specific thread (not the
entire process). So i would note that this for one thread only.

> +Implementation of the Shadow Stack
> +==================================
> +
> +Shadow Stack Size
> +-----------------
> +
> +A task's shadow stack is allocated from memory to a fixed size of
> +MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is allocated to
> +the maximum size of the normal stack, but capped to 4 GB. However,
> +a compat-mode application's address space is smaller, each of its thread's
> +shadow stack size is MIN(1/4 RLIMIT_STACK, 4 GB).

This policy tries to handle all threads with the same shadow stack
size logic, which has limitations. I think it should be improved
(otherwise some applications will have to turn shstk off):

- RLIMIT_STACK is not an upper bound for the main thread stack size
  (rlimit can increase/decrease dynamically).
- RLIMIT_STACK only applies to the main thread, so it is not an upper
  bound for non-main thread stacks.
- i.e. stack size >> startup RLIMIT_STACK is possible and then shadow
  stack can overflow.
- stack size << startup RLIMIT_STACK is also possible and then VA
  space is wasted (can lead to OOM with strict memory overcommit).
- clone3 tells the kernel the thread stack size so that should be
  used instead of RLIMIT_STACK. (clone does not though.)
- I think it's better to have a new limit specifically for shadow
  stack size (which by default can be RLIMIT_STACK) so userspace
  can adjust it if needed (another reason is that stack size is
  not always a good indicator of max call depth).

> +Signal
> +------
> +
> +By default, the main program and its signal handlers use the same shadow
> +stack. Because the shadow stack stores only return addresses, a large
> +shadow stack covers the condition that both the program stack and the
> +signal alternate stack run out.

What does "by default" mean here? Is there a case when the signal handler
is not entered with SSP set to the handling thread'd shadow stack?

> +When a signal happens, the old pre-signal state is pushed on the stack. When
> +shadow stack is enabled, the shadow stack specific state is pushed onto the
> +shadow stack. Today this is only the old SSP (shadow stack pointer), pushed
> +in a special format with bit 63 set. On sigreturn this old SSP token is
> +verified and restored by the kernel. The kernel will also push the normal
> +restorer address to the shadow stack to help userspace avoid a shadow stack
> +violation on the sigreturn path that goes through the restorer.

The kernel pushes on the shadow stack on signal entry so shadow stack
overflow cannot be handled. Please document this as non-recoverable
failure.

I think it can be made recoverable if signals with alternate stack run
on a different shadow stack. And the top of the thread shadow stack is
just corrupted instead of pushed in the overflow case. Then longjmp out
can be made to work (common in stack overflow handling cases), and
reliable crash report from the signal handler works (also common).

Does SSP get stored into the sigcontext struct somewhere?

> +Fork
> +----
> +
> +The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required
> +to be read-only and dirty. When a shadow stack PTE is not RO and dirty, a
> +shadow access triggers a page fault with the shadow stack access bit set
> +in the page fault error code.
> +
> +When a task forks a child, its shadow stack PTEs are copied and both the
> +parent's and the child's shadow stack PTEs are cleared of the dirty bit.
> +Upon the next shadow stack access, the resulting shadow stack page fault
> +is handled by page copy/re-use.
> +
> +When a pthread child is created, the kernel allocates a new shadow stack
> +for the new thread. New shadow stack's behave like mmap() with respect to
> +ASLR behavior.

Please document the shadow stack lifetimes here:

I think thread exit unmaps shadow stack and vfork shares shadow stack
with parent so exit does not unmap.

I think the map_shadow_stack syscall should be mentioned in this
document too.

ABI for initial shadow stack entries:

If one wants to scan the shadow stack how to detect the end (e.g. fast
backtrace)? Is it useful to put an invalid value (-1) there?
(affects map_shadow_stack syscall too).

thanks.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.