Hi, This is a slight reboot of the userspace CET series. I will be taking over the series from Yu-cheng. Per some internal recommendations, I’ve reset the version number and am calling it a new series. Hopefully, it doesn’t cause confusion. The new plan is to upstream only userspace Shadow Stack support at this point. IBT can follow later, but for now I’ll focus solely on the most in-demand and widely available (with the feature on AMD CPUs now) part of CET. I thought as part of this reset, it might be useful to more fully write-up the design and summarize the history of the previous CET series. So this slightly long cover letter does that. The "Updates" section has the changes, if anyone doesn't want the history. Why is Shadow Stack Wanted ========================== The main use case for userspace shadow stack is providing protection against return oriented programming attacks. Fedora and Ubuntu already have many/most packages enabled for shadow stack. The main missing piece is Linux kernel support and there seems to be a high amount of interest in the ecosystem for getting this feature supported. Besides security, Google has also done some work on using shadow stack to improve performance and reliability of tracing. Userspace Shadow Stack Implementation ===================================== Shadow stack works by maintaining a secondary (shadow) stack that cannot be directly modified by applications. When executing a CALL instruction, the processor pushes the return address to both the normal stack and to the special permissioned shadow stack. Upon ret, the processor pops the shadow stack copy and compares it to the normal stack copy. If the two differ, the processor raises a control protection fault. This implementation supports shadow stack on 64 bit kernels only, with support for 32 bit only via IA32 emulation. Shadow Stack Memory ------------------- The majority of this series deals with changes for handling the special shadow stack memory permissions. This memory is specified by the Dirty+RO PTE bits. A tricky aspect of this is that this combination was previously used to specify COW memory. So Linux needs to handle COW differently when shadow stack is in use. The solution is to use a software PTE bit to denote COW memory, and take care to clear the dirty bit when setting the memory RO. Setup and Upkeep of HW Registers -------------------------------- Using userspace CET requires a CR4 bit set, and also the manipulation of two xsave managed MSRs. The kernel needs to modify these registers during various operations like clone and signal handling. These operations may happen when the registers are restored to the CPU, or saved in an xsave buffer. Since the recent AMX triggered FPU overhaul removed direct access to the xsave buffer, this series adds an interface to operate on the supervisor xstate. New ABIs -------- This series introduces some new ABIs. The primary one is the shadow stack itself. Since it is readable and the shadow stack pointer is exposed to user space, applications can easily read and process the shadow stack. And in fact the tracing usages plan to do exactly that. Most of the shadow stack contents are written by HW, but some of the entries are added by the kernel. The main place for this is signals. As part of handling the signal the kernel does some manual adjustment of the shadow stack that userspace depends on. In addition to the contents of the shadow stack there is also user visible behavior around when new shadow stacks are created and set in the shadow stack pointer (SSP) register. This is relatively straightforward – shadow stacks are created when new stacks are created (thread creation, fork, etc). It is more or less what is required to keep apps working. For situations when userspace creates a new stack (i.e. makecontext(), fibers, etc), a new syscall is provided for creating shadow stack memory. To make the shadow stack usable, it needs to have a restore token written to the protected memory. So the syscall provides a way to specificity this should be done by the kernel. When a shadow stack violation happens (when the return address of stack not matching return address in shadow stack), a segfault is generated with a new si_code specific to CET violations. Lastly, a new arch_prctl interface is created for controlling the enablement of CET-like features. It is intended to also be used for LAM. It operates on the feature status per-thread, so for process wide enabling it is intended to be used early in things like dynamic linker/loaders. However, it can be used later for per-thread enablement of features like WRSS. WRSS ---- WRSS is an instruction that can write to shadow stacks. The HW provides a way to enable this instruction for userspace use. Since shadow stack’s are created initially protected, enabling WRSS allows any apps that want to do unusual things with their stacks to have a way to weaken protection and make things more flexible. A new feature bit is defined to control enabling/disabling of WRSS. History ======= The branding “CET” really consists of two features: “Shadow Stack” and “Indirect Branch Tracking”. They both restrict previously allowed, but rarely valid behaviors and require userspace to change to avoid these behaviors before enabling the protection. These raw HW features need to be assembled into a software solution across userspace and kernel in order to add security value. The kernel part of this solution has evolved iteratively starting with a lengthy RFC period. Until now, the enabling effort was trying to support both Shadow Stack and IBT. This history will focus on a few areas of the shadow stack development history that I thought stood out. Signals ------- Originally signals placed the location of the shadow stack restore token inside the saved state on the stack. This was problematic from a past ABI promises perspective. So the restore location was instead just assumed from the shadow stack pointer. This works because in normal allowed cases of calling sigreturn, the shadow stack pointer should be right at the restore token at that time. There is no alternate shadow stack support. If an alt shadow stack is added later we would need to find a place to store the regular shadow stack token location. Options could be to push something on the alt shadow stack, or to keep something on the kernel side. So the current design keeps things simple while slightly kicking the can down the road if alt shadow stacks become a thing later. Siglongjmp is handled in glibc, using the incssp instruction to unwind the shadow stack over the token. Shadow Stack Allocation ----------------------- makecontext() implementations need a way to create new shadow stacks with restore token’s such that they can be pivoted to from userspace. The first interface to do this was an arch_prctl(). It created a shadow stack with a restore token pre-setup, since the kernel has an instruction that can write to user shadow stacks. However, this interface was abandoned for being strange. The next version created PROT_SHADOW_STACK. This interface had two problems. One, it left no options but for userspace to create writable memory, write a restore token, then mproctect() it PROT_SHADOW_STACK. The writable window left the shadow stack exposed, weakening the security. Second, it caused problems with the guard pages. Since the memory was initially created writable it did not have a guard page, but then was mprotected later to a type of memory that should have one. This resulted in missing guard pages and confused rb_subtree_gap’s. This version introduces a new syscall that behaves similarly to the initial arch_prctl() interface in that it has the kernel write the restore token. Enabling Interface ------------------ For the entire history of the original CET series, the design was to enable shadow stack automatically if the feature bit was detected in the elf header. Then it was userspace’s responsibility to turn it off via an arch_prctl() if it was not desired, and this was handled by the glibc dynamic loader. Glibc’s standard behavior (when CET if configured is to leave shadow stack enabled if the executable and all linked libraries are marked with shadow stacks. Many distros (Fedora and others) have binaries already marked with shadow stack, waiting for kernel support. Unfortunately their glibc binaries expect the original arch_prctl() interface for allocating shadow stacks, as those changes were pushed ahead of kernel support. The net result of it all is, when updating to a kernel with shadow stack these binaries would suddenly get shadow stack enabled and expect the arch_prctl() interface to be there. And so calls to makecontext() will fail, resulting in visible breakages. This series deals with this problem as described below in "Updates". Updates ======= These updates were mostly driven by public comments, but a lot of the design elements are new. I would like some extra scrutiny on the updates. New syscall for Shadow Stack Allocation --------------------------------------- A new syscall is added for allocating shadow stacks to replace PROT_SHADOW_STACK. Several options were considered, as described in the “x86/cet/shstk: Introduce map_shadow_stack syscall”. Xsave Managed Supervisor State Modifications -------------------------------------------- The shadow stack feature requires the kernel to modify xsaves managed state. On one of the last versions of Yu-cheng’s series Boris had commented on the pattern it was using to do this not necessarily being ideal. The pattern was to force a restore to the registers and always do the modification there. Then Thomas did an overhaul of the fpu code, part of which consisted of making raw access to the xsave buffer private to the fpu code. So this series tries to expose access again, and in a way that addresses Boris’ comments. The method is to provide functions like wmsrl/rdmsrl, but that can direct the operation to the correct location (registers or buffer), while giving the proper notice to the fpu subsystem so things don’t get clobbered or corrupted. In the past a solution like this was discussed as part of the PASID series, and Thomas was not in favor. In CET’s case there is a more logic around the CET MSR’s than in PASID's, and wrapping this logic minimizes near identical open coded logic needed to do this more efficiently. In addition it resolves the above described problem of having no access to the xsave buffer. So it is being put forward here under the supposition that CET’s usage may lead to a different conclusion, not to try to ignore past direction. The user interrupt series has similar needs as CET, and will also use this internal interface if it’s found acceptable. Support for WRSS ---------------- Andy Lutomirski had asked if we change the shadow stack allocation API such that userspace cannot create arbitrary shadow stacks, then we look at exposing an interface to enable the WRSS instruction for userspace. This way app’s that want to do unexpected things with shadow stacks would still have the option to create shadow stacks with arbitrary data. Switch Enabling Interface ------------------------- As described above there is a problem with userspace binaries waiting to break as soon as the kernel supports CET. This needs to be prevented by changing the interface such that the old binaries will not enable shadow stack AND behave as if shadow stack is not enabled. They should run normally without shadow stack protection. Creating a new feature (SHSTK2) for shadow stack was explored. SHSTK would never be supported by the kernel, and all the userspace build tools would be updated to target SHSTK2 instead of SHSTK. So old SHSTK binaries would be cleanly disabled. But there are existing downsides to automatic elf header processing based enabling. The elf header feature spec is not defined by the kernel and there are proposals to expand it to describe additional logic. A simpler interface where the kernel is simply told what to enable, and leaves all the decision making to userspace, is more flexible for userspace and simpler for the kernel. There also already needs to be an ARCH_X86_FEATURE_ENABLE arch_prctl() for WRSS (and likely LAM will use it too), so it avoids there being two ways to turn on these types of features. The only tricky part for shadow stack, is that it has to be enabled very early. Wherever the shadow stack is enabled, the app cannot return from that point, otherwise there will be a shadow stack violation. It turns out glibc can enable shadow stack this early, so it works nicely. So not automatically enabling any features in the elf header will cleanly disable all old binaries, which expect the kernel to enable CET features automatically. Then after the kernel changes are upstream, glibc can be updated to use the new interface. This is the solution implemented in this series. Expand Commit Logs ------------------ As part of spinning up on this series, I found some of the commit logs did not describe the changes in enough detail for me understand their purpose. I tried to expand the logs and comments, where I had to go digging. Hopefully it’s useful. Limit to only Intel Processors ------------------------------ Shadow stack is supported on some AMD processors, but this revision (with expanded HW usage and xsaves changes) has only has been tested on Intel ones. So this series has a patch to limit shadow stack support to Intel processors. Ideally the patch would not even make it to mainline, and should be dropped as soon as this testing is done. It's included just in case. Future Work =========== Even though this is now exclusively a shadow stack series, there is still some remaining shadow stack work to be done. Ptrace ------ Early in the series, there was a patch to allow IA32_U_CET and IA32_PL3_SSP to be set. This patch was dropped and planned as a follow up to basic support, and it remains the plan. It will be needed for in-progress gdb support. CRIU Support ------------ In the past there was some speculation on the mailing list about whether CRIU would need to be taught about CET. It turns out, it does. The first issue hit is that CRIU calls sigreturn directly from its “parasite code” that it injects into the dumper process. This violates this shadow stack implementation’s protection that intends to prevent attackers from doing this. With so many packages already enabled with shadow stack, there is probably desire to make it work seamlessly. But in the meantime if distros want to support shadow stack and CRIU, users could manually disabled shadow stack via “GLIBC_TUNABLES=glibc.cpu.x86_shstk=off” for a process they will wants to dump. It’s not ideal. I’d like to hear what people think about having shadow stack in the kernel without this resolved. Nothing would change for any users until they enable shadow stack in the kernel and update to a glibc configured with CET. Should CRIU userspace be solved before kernel support? Selftests --------- There are some CET selftests being worked on and they are not included here. Thanks, Rick Rick Edgecombe (7): x86/mm: Prevent VM_WRITE shadow stacks x86/fpu: Add helpers for modifying supervisor xstate x86/fpu: Add unsafe xsave buffer helpers x86/cet/shstk: Introduce map_shadow_stack syscall selftests/x86: Add map_shadow_stack syscall test x86/cet/shstk: Support wrss for userspace x86/cpufeatures: Limit shadow stack to Intel CPUs Yu-cheng Yu (28): Documentation/x86: Add CET description x86/cet/shstk: Add Kconfig option for Shadow Stack x86/cpufeatures: Add CET CPU feature flags for Control-flow Enforcement Technology (CET) x86/cpufeatures: Introduce CPU setup and option parsing for CET x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states x86/cet: Add control-protection fault handler x86/mm: Remove _PAGE_DIRTY from kernel RO pages x86/mm: Move pmd_write(), pud_write() up in the file x86/mm: Introduce _PAGE_COW drm/i915/gvt: Change _PAGE_DIRTY to _PAGE_DIRTY_BITS x86/mm: Update pte_modify for _PAGE_COW x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW mm: Move VM_UFFD_MINOR_BIT from 37 to 38 mm: Introduce VM_SHADOW_STACK for shadow stack memory x86/mm: Check Shadow Stack page fault errors x86/mm: Update maybe_mkwrite() for shadow stack mm: Fixup places that call pte_mkwrite() directly mm: Add guard pages around a shadow stack. mm/mmap: Add shadow stack pages to memory accounting mm: Update can_follow_write_pte() for shadow stack mm/mprotect: Exclude shadow stack from preserve_write mm: Re-introduce vm_flags to do_mmap() x86/cet/shstk: Add user-mode shadow stack support x86/process: Change copy_thread() argument 'arg' to 'stack_size' x86/cet/shstk: Handle thread shadow stack x86/cet/shstk: Introduce shadow stack token setup/verify routines x86/cet/shstk: Handle signals for shadow stack x86/cet/shstk: Add arch_prctl elf feature functions .../admin-guide/kernel-parameters.txt | 4 + Documentation/filesystems/proc.rst | 1 + Documentation/x86/cet.rst | 145 ++++++ Documentation/x86/index.rst | 1 + arch/arm/kernel/signal.c | 2 +- arch/arm64/kernel/signal.c | 2 +- arch/arm64/kernel/signal32.c | 2 +- arch/sparc/kernel/signal32.c | 2 +- arch/sparc/kernel/signal_64.c | 2 +- arch/x86/Kconfig | 22 + arch/x86/Kconfig.assembler | 5 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/x86/ia32/ia32_signal.c | 25 +- arch/x86/include/asm/cet.h | 54 +++ arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/disabled-features.h | 8 +- arch/x86/include/asm/fpu/api.h | 8 + arch/x86/include/asm/fpu/types.h | 23 +- arch/x86/include/asm/fpu/xstate.h | 6 +- arch/x86/include/asm/idtentry.h | 4 + arch/x86/include/asm/mman.h | 24 + arch/x86/include/asm/mmu_context.h | 2 + arch/x86/include/asm/msr-index.h | 20 + arch/x86/include/asm/page_types.h | 7 + arch/x86/include/asm/pgtable.h | 302 ++++++++++-- arch/x86/include/asm/pgtable_types.h | 48 +- arch/x86/include/asm/processor.h | 6 + arch/x86/include/asm/special_insns.h | 30 ++ arch/x86/include/asm/trap_pf.h | 2 + arch/x86/include/uapi/asm/mman.h | 8 +- arch/x86/include/uapi/asm/prctl.h | 10 + arch/x86/include/uapi/asm/processor-flags.h | 2 + arch/x86/kernel/Makefile | 1 + arch/x86/kernel/cpu/common.c | 20 + arch/x86/kernel/cpu/cpuid-deps.c | 1 + arch/x86/kernel/elf_feature_prctl.c | 72 +++ arch/x86/kernel/fpu/xstate.c | 167 ++++++- arch/x86/kernel/idt.c | 4 + arch/x86/kernel/process.c | 17 +- arch/x86/kernel/process_64.c | 2 + arch/x86/kernel/shstk.c | 446 ++++++++++++++++++ arch/x86/kernel/signal.c | 13 + arch/x86/kernel/signal_compat.c | 2 +- arch/x86/kernel/traps.c | 62 +++ arch/x86/mm/fault.c | 19 + arch/x86/mm/mmap.c | 48 ++ arch/x86/mm/pat/set_memory.c | 2 +- arch/x86/mm/pgtable.c | 25 + drivers/gpu/drm/i915/gvt/gtt.c | 2 +- fs/aio.c | 2 +- fs/proc/task_mmu.c | 3 + include/linux/mm.h | 19 +- include/linux/pgtable.h | 8 + include/linux/syscalls.h | 1 + include/uapi/asm-generic/siginfo.h | 3 +- include/uapi/asm-generic/unistd.h | 2 +- ipc/shm.c | 2 +- kernel/sys_ni.c | 1 + mm/gup.c | 16 +- mm/huge_memory.c | 27 +- mm/memory.c | 5 +- mm/migrate.c | 3 +- mm/mmap.c | 15 +- mm/mprotect.c | 9 +- mm/nommu.c | 4 +- mm/util.c | 2 +- tools/testing/selftests/x86/Makefile | 9 +- .../selftests/x86/test_map_shadow_stack.c | 75 +++ 69 files changed, 1797 insertions(+), 92 deletions(-) create mode 100644 Documentation/x86/cet.rst create mode 100644 arch/x86/include/asm/cet.h create mode 100644 arch/x86/include/asm/mman.h create mode 100644 arch/x86/kernel/elf_feature_prctl.c create mode 100644 arch/x86/kernel/shstk.c create mode 100644 tools/testing/selftests/x86/test_map_shadow_stack.c base-commit: e783362eb54cd99b2cac8b3a9aeac942e6f6ac07 -- 2.17.1