(added more CRIU people) On Sun, Jan 30, 2022 at 01:18:03PM -0800, Rick Edgecombe wrote: > Hi, > > This is a slight reboot of the userspace CET series. I will be taking over the > series from Yu-cheng. Per some internal recommendations, I’ve reset the version > number and am calling it a new series. Hopefully, it doesn’t cause confusion. > > The new plan is to upstream only userspace Shadow Stack support at this point. > IBT can follow later, but for now I’ll focus solely on the most in-demand and > widely available (with the feature on AMD CPUs now) part of CET. > > I thought as part of this reset, it might be useful to more fully write-up the > design and summarize the history of the previous CET series. So this slightly > long cover letter does that. The "Updates" section has the changes, if anyone > doesn't want the history. > > > Why is Shadow Stack Wanted > ========================== > The main use case for userspace shadow stack is providing protection against > return oriented programming attacks. Fedora and Ubuntu already have many/most > packages enabled for shadow stack. The main missing piece is Linux kernel > support and there seems to be a high amount of interest in the ecosystem for > getting this feature supported. Besides security, Google has also done some > work on using shadow stack to improve performance and reliability of tracing. > > > Userspace Shadow Stack Implementation > ===================================== > Shadow stack works by maintaining a secondary (shadow) stack that cannot be > directly modified by applications. When executing a CALL instruction, the > processor pushes the return address to both the normal stack and to the special > permissioned shadow stack. Upon ret, the processor pops the shadow stack copy > and compares it to the normal stack copy. If the two differ, the processor > raises a control protection fault. This implementation supports shadow stack on > 64 bit kernels only, with support for 32 bit only via IA32 emulation. > > Shadow Stack Memory > ------------------- > The majority of this series deals with changes for handling the special > shadow stack memory permissions. This memory is specified by the > Dirty+RO PTE bits. A tricky aspect of this is that this combination was > previously used to specify COW memory. So Linux needs to handle COW > differently when shadow stack is in use. The solution is to use a > software PTE bit to denote COW memory, and take care to clear the dirty > bit when setting the memory RO. > > Setup and Upkeep of HW Registers > -------------------------------- > Using userspace CET requires a CR4 bit set, and also the manipulation > of two xsave managed MSRs. The kernel needs to modify these registers > during various operations like clone and signal handling. These > operations may happen when the registers are restored to the CPU, or > saved in an xsave buffer. Since the recent AMX triggered FPU overhaul > removed direct access to the xsave buffer, this series adds an > interface to operate on the supervisor xstate. > > New ABIs > -------- > This series introduces some new ABIs. The primary one is the shadow > stack itself. Since it is readable and the shadow stack pointer is > exposed to user space, applications can easily read and process the > shadow stack. And in fact the tracing usages plan to do exactly that. > > Most of the shadow stack contents are written by HW, but some of the > entries are added by the kernel. The main place for this is signals. As > part of handling the signal the kernel does some manual adjustment of > the shadow stack that userspace depends on. > > In addition to the contents of the shadow stack there is also user > visible behavior around when new shadow stacks are created and set in > the shadow stack pointer (SSP) register. This is relatively > straightforward – shadow stacks are created when new stacks are created > (thread creation, fork, etc). It is more or less what is required to > keep apps working. > > For situations when userspace creates a new stack (i.e. makecontext(), > fibers, etc), a new syscall is provided for creating shadow stack > memory. To make the shadow stack usable, it needs to have a restore > token written to the protected memory. So the syscall provides a way to > specificity this should be done by the kernel. > > When a shadow stack violation happens (when the return address of stack > not matching return address in shadow stack), a segfault is generated > with a new si_code specific to CET violations. > > Lastly, a new arch_prctl interface is created for controlling the > enablement of CET-like features. It is intended to also be used for > LAM. It operates on the feature status per-thread, so for process wide > enabling it is intended to be used early in things like dynamic > linker/loaders. However, it can be used later for per-thread enablement > of features like WRSS. > > WRSS > ---- > WRSS is an instruction that can write to shadow stacks. The HW provides > a way to enable this instruction for userspace use. Since shadow > stack’s are created initially protected, enabling WRSS allows any apps > that want to do unusual things with their stacks to have a way to > weaken protection and make things more flexible. A new feature bit is > defined to control enabling/disabling of WRSS. > > > History > ======= > The branding “CET” really consists of two features: “Shadow Stack” and > “Indirect Branch Tracking”. They both restrict previously allowed, but rarely > valid behaviors and require userspace to change to avoid these behaviors before > enabling the protection. These raw HW features need to be assembled into a > software solution across userspace and kernel in order to add security value. > The kernel part of this solution has evolved iteratively starting with a lengthy > RFC period. > > Until now, the enabling effort was trying to support both Shadow Stack and IBT. > This history will focus on a few areas of the shadow stack development history > that I thought stood out. > > Signals > ------- > Originally signals placed the location of the shadow stack restore > token inside the saved state on the stack. This was problematic from a > past ABI promises perspective. So the restore location was instead just > assumed from the shadow stack pointer. This works because in normal > allowed cases of calling sigreturn, the shadow stack pointer should be > right at the restore token at that time. There is no alternate shadow > stack support. If an alt shadow stack is added later we would need to > find a place to store the regular shadow stack token location. Options > could be to push something on the alt shadow stack, or to keep > something on the kernel side. So the current design keeps things simple > while slightly kicking the can down the road if alt shadow stacks > become a thing later. Siglongjmp is handled in glibc, using the incssp > instruction to unwind the shadow stack over the token. > > Shadow Stack Allocation > ----------------------- > makecontext() implementations need a way to create new shadow stacks > with restore token’s such that they can be pivoted to from userspace. > The first interface to do this was an arch_prctl(). It created a shadow > stack with a restore token pre-setup, since the kernel has an > instruction that can write to user shadow stacks. However, this > interface was abandoned for being strange. > > The next version created PROT_SHADOW_STACK. This interface had two > problems. One, it left no options but for userspace to create writable > memory, write a restore token, then mproctect() it PROT_SHADOW_STACK. > The writable window left the shadow stack exposed, weakening the > security. Second, it caused problems with the guard pages. Since the > memory was initially created writable it did not have a guard page, but > then was mprotected later to a type of memory that should have one. > This resulted in missing guard pages and confused rb_subtree_gap’s. > > This version introduces a new syscall that behaves similarly to the > initial arch_prctl() interface in that it has the kernel write the > restore token. > > Enabling Interface > ------------------ > For the entire history of the original CET series, the design was to > enable shadow stack automatically if the feature bit was detected in > the elf header. Then it was userspace’s responsibility to turn it off > via an arch_prctl() if it was not desired, and this was handled by the > glibc dynamic loader. Glibc’s standard behavior (when CET if configured > is to leave shadow stack enabled if the executable and all linked > libraries are marked with shadow stacks. > > Many distros (Fedora and others) have binaries already marked with > shadow stack, waiting for kernel support. Unfortunately their glibc > binaries expect the original arch_prctl() interface for allocating > shadow stacks, as those changes were pushed ahead of kernel support. > The net result of it all is, when updating to a kernel with shadow > stack these binaries would suddenly get shadow stack enabled and expect > the arch_prctl() interface to be there. And so calls to makecontext() > will fail, resulting in visible breakages. This series deals with this > problem as described below in "Updates". > > > Updates > ======= > These updates were mostly driven by public comments, but a lot of the design > elements are new. I would like some extra scrutiny on the updates. > > New syscall for Shadow Stack Allocation > --------------------------------------- > A new syscall is added for allocating shadow stacks to replace > PROT_SHADOW_STACK. Several options were considered, as described in the > “x86/cet/shstk: Introduce map_shadow_stack syscall”. > > Xsave Managed Supervisor State Modifications > -------------------------------------------- > The shadow stack feature requires the kernel to modify xsaves managed > state. On one of the last versions of Yu-cheng’s series Boris had > commented on the pattern it was using to do this not necessarily being > ideal. The pattern was to force a restore to the registers and always > do the modification there. Then Thomas did an overhaul of the fpu code, > part of which consisted of making raw access to the xsave buffer > private to the fpu code. So this series tries to expose access again, > and in a way that addresses Boris’ comments. > > The method is to provide functions like wmsrl/rdmsrl, but that can > direct the operation to the correct location (registers or buffer), > while giving the proper notice to the fpu subsystem so things don’t get > clobbered or corrupted. > > In the past a solution like this was discussed as part of the PASID > series, and Thomas was not in favor. In CET’s case there is a more > logic around the CET MSR’s than in PASID's, and wrapping this logic > minimizes near identical open coded logic needed to do this more > efficiently. In addition it resolves the above described problem of > having no access to the xsave buffer. So it is being put forward here > under the supposition that CET’s usage may lead to a different > conclusion, not to try to ignore past direction. > > The user interrupt series has similar needs as CET, and will also use > this internal interface if it’s found acceptable. > > Support for WRSS > ---------------- > Andy Lutomirski had asked if we change the shadow stack allocation API > such that userspace cannot create arbitrary shadow stacks, then we look > at exposing an interface to enable the WRSS instruction for userspace. > This way app’s that want to do unexpected things with shadow stacks > would still have the option to create shadow stacks with arbitrary > data. > > Switch Enabling Interface > ------------------------- > As described above there is a problem with userspace binaries waiting > to break as soon as the kernel supports CET. This needs to be prevented > by changing the interface such that the old binaries will not enable > shadow stack AND behave as if shadow stack is not enabled. They should > run normally without shadow stack protection. Creating a new feature > (SHSTK2) for shadow stack was explored. SHSTK would never be supported > by the kernel, and all the userspace build tools would be updated to > target SHSTK2 instead of SHSTK. So old SHSTK binaries would be cleanly > disabled. > > But there are existing downsides to automatic elf header processing > based enabling. The elf header feature spec is not defined by the > kernel and there are proposals to expand it to describe additional > logic. A simpler interface where the kernel is simply told what to > enable, and leaves all the decision making to userspace, is more > flexible for userspace and simpler for the kernel. There also already > needs to be an ARCH_X86_FEATURE_ENABLE arch_prctl() for WRSS (and > likely LAM will use it too), so it avoids there being two ways to turn > on these types of features. The only tricky part for shadow stack, is > that it has to be enabled very early. Wherever the shadow stack is > enabled, the app cannot return from that point, otherwise there will be > a shadow stack violation. It turns out glibc can enable shadow stack > this early, so it works nicely. So not automatically enabling any > features in the elf header will cleanly disable all old binaries, which > expect the kernel to enable CET features automatically. Then after the > kernel changes are upstream, glibc can be updated to use the new > interface. This is the solution implemented in this series. > > Expand Commit Logs > ------------------ > As part of spinning up on this series, I found some of the commit logs > did not describe the changes in enough detail for me understand their > purpose. I tried to expand the logs and comments, where I had to go > digging. Hopefully it’s useful. > > Limit to only Intel Processors > ------------------------------ > Shadow stack is supported on some AMD processors, but this revision > (with expanded HW usage and xsaves changes) has only has been tested on > Intel ones. So this series has a patch to limit shadow stack support to > Intel processors. Ideally the patch would not even make it to mainline, > and should be dropped as soon as this testing is done. It's included > just in case. > > > Future Work > =========== > Even though this is now exclusively a shadow stack series, there is still some > remaining shadow stack work to be done. > > Ptrace > ------ > Early in the series, there was a patch to allow IA32_U_CET and > IA32_PL3_SSP to be set. This patch was dropped and planned as a follow > up to basic support, and it remains the plan. It will be needed for > in-progress gdb support. > > CRIU Support > ------------ > In the past there was some speculation on the mailing list about > whether CRIU would need to be taught about CET. It turns out, it does. > The first issue hit is that CRIU calls sigreturn directly from its > “parasite code” that it injects into the dumper process. This violates > this shadow stack implementation’s protection that intends to prevent > attackers from doing this. > > With so many packages already enabled with shadow stack, there is > probably desire to make it work seamlessly. But in the meantime if > distros want to support shadow stack and CRIU, users could manually > disabled shadow stack via “GLIBC_TUNABLES=glibc.cpu.x86_shstk=off” for > a process they will wants to dump. It’s not ideal. > > I’d like to hear what people think about having shadow stack in the > kernel without this resolved. Nothing would change for any users until > they enable shadow stack in the kernel and update to a glibc configured > with CET. Should CRIU userspace be solved before kernel support? > > Selftests > --------- > There are some CET selftests being worked on and they are not included > here. > > Thanks, > > Rick > > Rick Edgecombe (7): > x86/mm: Prevent VM_WRITE shadow stacks > x86/fpu: Add helpers for modifying supervisor xstate > x86/fpu: Add unsafe xsave buffer helpers > x86/cet/shstk: Introduce map_shadow_stack syscall > selftests/x86: Add map_shadow_stack syscall test > x86/cet/shstk: Support wrss for userspace > x86/cpufeatures: Limit shadow stack to Intel CPUs > > Yu-cheng Yu (28): > Documentation/x86: Add CET description > x86/cet/shstk: Add Kconfig option for Shadow Stack > x86/cpufeatures: Add CET CPU feature flags for Control-flow > Enforcement Technology (CET) > x86/cpufeatures: Introduce CPU setup and option parsing for CET > x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states > x86/cet: Add control-protection fault handler > x86/mm: Remove _PAGE_DIRTY from kernel RO pages > x86/mm: Move pmd_write(), pud_write() up in the file > x86/mm: Introduce _PAGE_COW > drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS > x86/mm: Update pte_modify for _PAGE_COW > x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for > transition from _PAGE_DIRTY to _PAGE_COW > mm: Move VM_UFFD_MINOR_BIT from 37 to 38 > mm: Introduce VM_SHADOW_STACK for shadow stack memory > x86/mm: Check Shadow Stack page fault errors > x86/mm: Update maybe_mkwrite() for shadow stack > mm: Fixup places that call pte_mkwrite() directly > mm: Add guard pages around a shadow stack. > mm/mmap: Add shadow stack pages to memory accounting > mm: Update can_follow_write_pte() for shadow stack > mm/mprotect: Exclude shadow stack from preserve_write > mm: Re-introduce vm_flags to do_mmap() > x86/cet/shstk: Add user-mode shadow stack support > x86/process: Change copy_thread() argument 'arg' to 'stack_size' > x86/cet/shstk: Handle thread shadow stack > x86/cet/shstk: Introduce shadow stack token setup/verify routines > x86/cet/shstk: Handle signals for shadow stack > x86/cet/shstk: Add arch_prctl elf feature functions > > .../admin-guide/kernel-parameters.txt | 4 + > Documentation/filesystems/proc.rst | 1 + > Documentation/x86/cet.rst | 145 ++++++ > Documentation/x86/index.rst | 1 + > arch/arm/kernel/signal.c | 2 +- > arch/arm64/kernel/signal.c | 2 +- > arch/arm64/kernel/signal32.c | 2 +- > arch/sparc/kernel/signal32.c | 2 +- > arch/sparc/kernel/signal_64.c | 2 +- > arch/x86/Kconfig | 22 + > arch/x86/Kconfig.assembler | 5 + > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > arch/x86/ia32/ia32_signal.c | 25 +- > arch/x86/include/asm/cet.h | 54 +++ > arch/x86/include/asm/cpufeatures.h | 1 + > arch/x86/include/asm/disabled-features.h | 8 +- > arch/x86/include/asm/fpu/api.h | 8 + > arch/x86/include/asm/fpu/types.h | 23 +- > arch/x86/include/asm/fpu/xstate.h | 6 +- > arch/x86/include/asm/idtentry.h | 4 + > arch/x86/include/asm/mman.h | 24 + > arch/x86/include/asm/mmu_context.h | 2 + > arch/x86/include/asm/msr-index.h | 20 + > arch/x86/include/asm/page_types.h | 7 + > arch/x86/include/asm/pgtable.h | 302 ++++++++++-- > arch/x86/include/asm/pgtable_types.h | 48 +- > arch/x86/include/asm/processor.h | 6 + > arch/x86/include/asm/special_insns.h | 30 ++ > arch/x86/include/asm/trap_pf.h | 2 + > arch/x86/include/uapi/asm/mman.h | 8 +- > arch/x86/include/uapi/asm/prctl.h | 10 + > arch/x86/include/uapi/asm/processor-flags.h | 2 + > arch/x86/kernel/Makefile | 1 + > arch/x86/kernel/cpu/common.c | 20 + > arch/x86/kernel/cpu/cpuid-deps.c | 1 + > arch/x86/kernel/elf_feature_prctl.c | 72 +++ > arch/x86/kernel/fpu/xstate.c | 167 ++++++- > arch/x86/kernel/idt.c | 4 + > arch/x86/kernel/process.c | 17 +- > arch/x86/kernel/process_64.c | 2 + > arch/x86/kernel/shstk.c | 446 ++++++++++++++++++ > arch/x86/kernel/signal.c | 13 + > arch/x86/kernel/signal_compat.c | 2 +- > arch/x86/kernel/traps.c | 62 +++ > arch/x86/mm/fault.c | 19 + > arch/x86/mm/mmap.c | 48 ++ > arch/x86/mm/pat/set_memory.c | 2 +- > arch/x86/mm/pgtable.c | 25 + > drivers/gpu/drm/i915/gvt/gtt.c | 2 +- > fs/aio.c | 2 +- > fs/proc/task_mmu.c | 3 + > include/linux/mm.h | 19 +- > include/linux/pgtable.h | 8 + > include/linux/syscalls.h | 1 + > include/uapi/asm-generic/siginfo.h | 3 +- > include/uapi/asm-generic/unistd.h | 2 +- > ipc/shm.c | 2 +- > kernel/sys_ni.c | 1 + > mm/gup.c | 16 +- > mm/huge_memory.c | 27 +- > mm/memory.c | 5 +- > mm/migrate.c | 3 +- > mm/mmap.c | 15 +- > mm/mprotect.c | 9 +- > mm/nommu.c | 4 +- > mm/util.c | 2 +- > tools/testing/selftests/x86/Makefile | 9 +- > .../selftests/x86/test_map_shadow_stack.c | 75 +++ > 69 files changed, 1797 insertions(+), 92 deletions(-) > create mode 100644 Documentation/x86/cet.rst > create mode 100644 arch/x86/include/asm/cet.h > create mode 100644 arch/x86/include/asm/mman.h > create mode 100644 arch/x86/kernel/elf_feature_prctl.c > create mode 100644 arch/x86/kernel/shstk.c > create mode 100644 tools/testing/selftests/x86/test_map_shadow_stack.c > > > base-commit: e783362eb54cd99b2cac8b3a9aeac942e6f6ac07 > -- > 2.17.1 -- Sincerely yours, Mike.