Re: [PATCH v7 01/41] Documentation/x86: Add CET shadow stack description

"Edgecombe, Rick P" <rick.p.edgecombe@xxxxxxxxx> · Wed, 1 Mar 2023 18:07:33 +0000

On Wed, 2023-03-01 at 14:21 +0000, Szabolcs Nagy wrote:
> The 02/27/2023 14:29, Rick Edgecombe wrote:
> > +Application Enabling
> > +====================
> > +
> > +An application's CET capability is marked in its ELF note and can
> > be verified
> > +from readelf/llvm-readelf output::
> > +
> > +    readelf -n <application> | grep -a SHSTK
> > +        properties: x86 feature: SHSTK
> > +
> > +The kernel does not process these applications markers directly.
> > Applications
> > +or loaders must enable CET features using the interface described
> > in section 4.
> > +Typically this would be done in dynamic loader or static runtime
> > objects, as is
> > +the case in GLIBC.
> 
> Note that this has to be an early decision in libc (ld.so or static
> exe start code), which will be difficult to hook into system wide
> security policy settings. (e.g. to force shstk on marked binaries.)

In the eager enabling (by the kernel) scenario, how is this improved?
The loader has to have the option to disable the shadow stack if
enabling conditions are not met, so it still has to trust userspace to
not do that. Did you have any more specifics on how the policy would
work?

> 
> From userspace POV I'd prefer if a static exe did not have to parse
> its own ELF notes (i.e. kernel enabled shstk based on the marking).

This is actually exactly what happens in the glibc patches. My
understand was that it already been discussed amongst glibc folks.

> But I realize if there is a need for complex shstk enable/disable
> decision that is better in userspace and if the kernel decision can
> be overridden then it might as well all be in userspace.

A complication with shadow stack in general is that it has to be
enabled very early. Otherwise when the program returns from main(), it
will get a shadow stack underflow. The old logic in this series would
enable shadow stack if the loader had the SHSTK bit (by parsing the
header in the kernel). Then later if the conditions were not met to use
shadow stack, the loader would call into the kernel again to disable
shadow stack.

One problem (there were several with this area) with this eager
enabling, was the kernel ended up mapping, briefly using, and then
unmapping the shadow stack in the case of a executable not supporting
shadow stack. What the glibc patches do today is pretty much the same
behavior as before, just with the header parsing moved into userspace.
I think letting the component with the most information make the
decision leaves open the best opportunity for making it efficient. I
wonder if it could be possible for glibc to enable it later than it
currently does in the patches and improve the dynamic loader case, but
I don't know enough of that code.

> 
> > +Enabling arch_prctl()'s
> > +=======================
> > +
> > +Elf features should be enabled by the loader using the below
> > arch_prctl's. They
> > +are only supported in 64 bit user applications.
> > +
> > +arch_prctl(ARCH_SHSTK_ENABLE, unsigned long feature)
> > +    Enable a single feature specified in 'feature'. Can only
> > operate on
> > +    one feature at a time.
> > +
> > +arch_prctl(ARCH_SHSTK_DISABLE, unsigned long feature)
> > +    Disable a single feature specified in 'feature'. Can only
> > operate on
> > +    one feature at a time.
> > +
> > +arch_prctl(ARCH_SHSTK_LOCK, unsigned long features)
> > +    Lock in features at their current enabled or disabled status.
> > 'features'
> > +    is a mask of all features to lock. All bits set are processed,
> > unset bits
> > +    are ignored. The mask is ORed with the existing value. So any
> > feature bits
> > +    set here cannot be enabled or disabled afterwards.
> 
> The multi-thread behaviour should be documented here: Only the
> current thread is affected. So an application can only change the
> setting while single-threaded which is only guaranteed before any
> user code is executed. Later using the prctl is complicated and
> most c runtimes would not want to do that (async signalling all
> threads and prctl from the handler).

It is kind of covered in the fork() docs, but yes there should probably
be a reference here too.

> 
> In particular these interfaces are not suitable to turn shstk off
> at dlopen time when an unmarked binary is loaded. Or any other
> late shstk policy change will not work, so as far as i can see
> the "permissive" mode in glibc does not work.

Yes, that is correct. Glibc permissive mode does not fully work. There
are some ongoing discussions on how to make it work. Some options don't
require kernel changes, and some do. Making it per-thread is
complicated for x86 because when shadow stack is off, some of the
special shadow stack instructions will cause #UD exception. Glibc (any
probably other apps in the future) could be in the middle of executing
these instructions when dlopen() was called. So if there was a process
wide disable option it would have to be resilient to these #UDs. And
even then the code that used them could not be guaranteed to continue
to work. For example, if you call the gcc intrinsic _get_ssp() when
shadow stack is enabled it could be expected to point to the shadow
stack in most cases. If shadow stack gets disabled, rdssp will return
0, in which case reading the shadow stack would segfault. So the all-
process disabling solution can't be fully robust when there is any
shadow stack specific logic.

The other option discussed was creating trampolines between the
linked legacy objects that could know to tell the kernel to disable
shadow stack if needed. In this case, shadow stack is disabled for each
thread as it calls into the DSO. It's not clear if there can be enough
information gleaned from the legacy binaries to know when to generate
the trampolines in exotic cases.

A third option might be to have some synchronization between the kernel
and userspace around anything using the shadow stack instructions. But
there is not much detail filled in there.

So in summary, it's not as simple as making the disable per-process.

> 
> Does the main thread have shadow stack allocated before shstk is
> enabled?

No.

> is the shadow stack freed when it is disabled? (e.g.
> what would the instruction reading the SSP do in disabled state?)

Yes.

When shadow stack is disabled rdssp is a NOP, the intrinsic returns
NULL.

> 
> > +Proc Status
> > +===========
> > +To check if an application is actually running with shadow stack,
> > the
> > +user can read the /proc/$PID/status. It will report "wrss" or
> > "shstk"
> > +depending on what is enabled. The lines look like this::
> > +
> > +    x86_Thread_features: shstk wrss
> > +    x86_Thread_features_locked: shstk wrss
> 
> Presumaly /proc/$TID/status and /proc/$PID/task/$TID/status also
> shows the setting and only valid for the specific thread (not the
> entire process). So i would note that this for one thread only.

Since enabling/disabling is per-thread, and the field is called
"x86_Thread_features" I thought it was clear. It's easy to add some
more detail though.

> 
> > +Implementation of the Shadow Stack
> > +==================================
> > +
> > +Shadow Stack Size
> > +-----------------
> > +
> > +A task's shadow stack is allocated from memory to a fixed size of
> > +MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is
> > allocated to
> > +the maximum size of the normal stack, but capped to 4 GB. However,
> > +a compat-mode application's address space is smaller, each of its
> > thread's
> > +shadow stack size is MIN(1/4 RLIMIT_STACK, 4 GB).
> 
> This policy tries to handle all threads with the same shadow stack
> size logic, which has limitations. I think it should be improved
> (otherwise some applications will have to turn shstk off):
> 
> - RLIMIT_STACK is not an upper bound for the main thread stack size
>   (rlimit can increase/decrease dynamically).
> - RLIMIT_STACK only applies to the main thread, so it is not an upper
>   bound for non-main thread stacks.
> - i.e. stack size >> startup RLIMIT_STACK is possible and then shadow
>   stack can overflow.
> - stack size << startup RLIMIT_STACK is also possible and then VA
>   space is wasted (can lead to OOM with strict memory overcommit).
> - clone3 tells the kernel the thread stack size so that should be
>   used instead of RLIMIT_STACK. (clone does not though.)

This actually happens already. I can update the docs.

> - I think it's better to have a new limit specifically for shadow
>   stack size (which by default can be RLIMIT_STACK) so userspace
>   can adjust it if needed (another reason is that stack size is
>   not always a good indicator of max call depth).

Hmm, yea. This seems like a good idea, but I don't see why it can't be
a follow on. The series is quite big just to get the basics. I have
tried to save some of the enhancements (like alt shadow stack) for the
future.

> 
> > +Signal
> > +------
> > +
> > +By default, the main program and its signal handlers use the same
> > shadow
> > +stack. Because the shadow stack stores only return addresses, a
> > large
> > +shadow stack covers the condition that both the program stack and
> > the
> > +signal alternate stack run out.
> 
> What does "by default" mean here? Is there a case when the signal
> handler
> is not entered with SSP set to the handling thread'd shadow stack?

Ah, yea, that could be updated. It is in reference to an alt shadow
stack implementation that was held for later.

> 
> > +When a signal happens, the old pre-signal state is pushed on the
> > stack. When
> > +shadow stack is enabled, the shadow stack specific state is pushed
> > onto the
> > +shadow stack. Today this is only the old SSP (shadow stack
> > pointer), pushed
> > +in a special format with bit 63 set. On sigreturn this old SSP
> > token is
> > +verified and restored by the kernel. The kernel will also push the
> > normal
> > +restorer address to the shadow stack to help userspace avoid a
> > shadow stack
> > +violation on the sigreturn path that goes through the restorer.
> 
> The kernel pushes on the shadow stack on signal entry so shadow stack
> overflow cannot be handled. Please document this as non-recoverable
> failure.

It doesn't hurt to call it out. Please see the below link for future
plans to handle this scenario (alt shadow stack).

> 
> I think it can be made recoverable if signals with alternate stack
> run
> on a different shadow stack. And the top of the thread shadow stack
> is
> just corrupted instead of pushed in the overflow case. Then longjmp
> out
> can be made to work (common in stack overflow handling cases), and
> reliable crash report from the signal handler works (also common).
> 
> Does SSP get stored into the sigcontext struct somewhere?

No, it's pushed to the shadow stack only. See the v2 coverletter of the
discussion on the design and reasoning:

https://lore.kernel.org/lkml/20220929222936.14584-1-rick.p.edgecombe@xxxxxxxxx/

> 
> > +Fork
> > +----
> > +
> > +The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are
> > required
> > +to be read-only and dirty. When a shadow stack PTE is not RO and
> > dirty, a
> > +shadow access triggers a page fault with the shadow stack access
> > bit set
> > +in the page fault error code.
> > +
> > +When a task forks a child, its shadow stack PTEs are copied and
> > both the
> > +parent's and the child's shadow stack PTEs are cleared of the
> > dirty bit.
> > +Upon the next shadow stack access, the resulting shadow stack page
> > fault
> > +is handled by page copy/re-use.
> > +
> > +When a pthread child is created, the kernel allocates a new shadow
> > stack
> > +for the new thread. New shadow stack's behave like mmap() with
> > respect to
> > +ASLR behavior.
> 
> Please document the shadow stack lifetimes here:
> 
> I think thread exit unmaps shadow stack and vfork shares shadow stack
> with parent so exit does not unmap.

Sure, this can be updated.

> 
> I think the map_shadow_stack syscall should be mentioned in this
> document too.

There is a man page prepared for this. I plan to update the docs to
reference it when it exists and not duplicate the text. There can be a
blurb for the time being but it would be short lived.

> If one wants to scan the shadow stack how to detect the end (e.g.
> fast
> backtrace)? Is it useful to put an invalid value (-1) there?
> (affects map_shadow_stack syscall too).

Interesting idea. I think it's probably not a breaking ABI change if we
wanted to add it later.