Modern Windows applications are executing system call instructions directly from the application's code without going through the WinAPI. This breaks Wine emulation, because it doesn't have a chance to intercept and emulate these syscalls before they are submitted to Linux. In addition, we cannot simply trap every system call of the application to userspace using PTRACE_SYSEMU, because performance would suffer, since our main use case is to run Windows games over Linux. Therefore, we need some in-kernel filtering to decide whether the syscall was issued by the wine code or by the windows application. The filtering cannot really be done based solely on the syscall number, because those could collide with existing Linux syscalls. Instead, our proposed solution is to trap syscalls based on the userspace memory region that triggered the syscall, as wine is responsible for the Windows code allocations and it can apply correct memory protections to those areas. Therefore, this patch reuses the seccomp infrastructure to trap system calls, but introduces a new mode to trap based on a vma attribute that describes whether the userspace memory region is allowed to execute syscalls or not. The protection is defined at mmap/mprotect time with a new protection flag PROT_NOSYSCALL. This setting only takes effect if the new SECCOMP_MODE_MEMMAP is enabled through seccomp(). It goes without saying that this is in no way a security mechanism despite being built on top of seccomp, since an evil application can always jump to a whitelisted memory region and run the syscall. This is not a concern for Wine games. Nevertheless, we reuse seccomp as a way to avoid adding a new mechanism to essentially do the same job of filtering system calls. * Why not SECCOMP_MODE_FILTER? We experimented with dynamically generating BPF filters for whitelisted memory regions and using SECCOMP_MODE_FILTER, but there are a few reasons why it isn't enough nor a good idea for our use case: 1. We cannot set the filters at program initialization time and forget about it, since there is no way of knowing which modules will be loaded, whether native and windows. Filter would need a way to be updated frequently during game execution. 2. We cannot predict which Linux libraries will issue syscalls directly. Most of the time, whitelisting libc and a few other libraries is enough, but there are no guarantees other Linux libraries won't issue syscalls directly and break the execution. Adding every linux library that is loaded also has a large performance cost due to the large resulting filter. 3. As I mentioned before, performance is critical. In our testing with just a single memory segment blacklisted/whitelisted, the minimum size of a bpf filter would be 4 instructions. In that scenario, SECCOMP_MODE_FILTER added an average overhead of 10% to the execution time of sysinfo(2) in comparison to seccomp disabled, while the impact of SECCOMP_MODE_MEMMAP was averaged around 1.5%. Indeed, points 1 and 2 could be worked around with some userspace work and improved SECCOMP_MODE_FILTER support, but at a high performance and some stability cost, to obtain the semantics we want. Still, the performance would suffer, and SECCOMP_MODE_MEMMAP is non intrusive enough that I believe it should be considered as an upstream solution. Sending as an RFC for now to get the discussion started. In particular: 1) Would this design be acceptable? 2) I'm not comfortable with using the high part of vm_flags, though I'm not sure where else it would fit. Suggestions? CC: Thomas Gleixner <tglx@xxxxxxxxxxxxx> CC: Kees Cook <keescook@xxxxxxxxxxxx> CC: Andy Lutomirski <luto@xxxxxxxxxxxxxx> CC: Will Drewry <wad@xxxxxxxxxxxx> CC: H. Peter Anvin <hpa@xxxxxxxxx> CC: Paul Gofman <gofmanp@xxxxxxxxx> Signed-off-by: Gabriel Krisman Bertazi <krisman@xxxxxxxxxxxxx> --- arch/Kconfig | 21 +++++++++ arch/x86/Kconfig | 1 + include/linux/mm.h | 8 ++++ include/linux/mman.h | 4 +- include/uapi/asm-generic/mman-common.h | 2 + include/uapi/linux/seccomp.h | 2 + kernel/seccomp.c | 59 ++++++++++++++++++++++++++ mm/mprotect.c | 2 +- 8 files changed, 97 insertions(+), 2 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 786a85d4ad40..e5c10231c9c2 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -471,6 +471,27 @@ config SECCOMP_FILTER See Documentation/userspace-api/seccomp_filter.rst for details. +config HAVE_ARCH_SECCOMP_MEMMAP + bool + help + An arch should select this symbol if it provides all of these things: + - syscall_get_arch() + - syscall_get_arguments() + - syscall_rollback() + - syscall_set_return_value() + - SIGSYS siginfo_t support + - secure_computing is called from a ptrace_event()-safe context + - secure_computing return value is checked and a return value of -1 + results in the system call being skipped immediately. + - seccomp syscall wired up + +config SECCOMP_MEMMAP + def_bool y + depends on HAVE_ARCH_SECCOMP_MEMMAP && SECCOMP + help + Enable tasks to trap syscalls based on the page that triggered + the syscall. + config HAVE_ARCH_STACKLEAK bool help diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 2d3f963fd6f1..70998b01c533 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -145,6 +145,7 @@ config X86 select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT select HAVE_ARCH_PREL32_RELOCATIONS select HAVE_ARCH_SECCOMP_FILTER + select HAVE_ARCH_SECCOMP_MEMMAP select HAVE_ARCH_THREAD_STRUCT_WHITELIST select HAVE_ARCH_STACKLEAK select HAVE_ARCH_TRACEHOOK diff --git a/include/linux/mm.h b/include/linux/mm.h index 5a323422d783..6271fb7fd5bc 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -294,11 +294,13 @@ extern unsigned int kobjsize(const void *objp); #define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ #ifdef CONFIG_ARCH_HAS_PKEYS @@ -340,6 +342,12 @@ extern unsigned int kobjsize(const void *objp); # define VM_GROWSUP VM_NONE #endif +#if defined(CONFIG_X86) +# define VM_NOSYSCALL VM_HIGH_ARCH_5 +#else +# define VM_NOSYSCALL VM_NONE +#endif + /* Bits set in the VMA until the stack is in its final location */ #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ) diff --git a/include/linux/mman.h b/include/linux/mman.h index 4b08e9c9c538..a5ca42eb685a 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -94,7 +94,8 @@ static inline void vm_unacct_memory(long pages) */ static inline bool arch_validate_prot(unsigned long prot, unsigned long addr) { - return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM)) == 0; + return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM + | PROT_NOSYSCALL)) == 0; } #define arch_validate_prot arch_validate_prot #endif @@ -119,6 +120,7 @@ calc_vm_prot_bits(unsigned long prot, unsigned long pkey) return _calc_vm_trans(prot, PROT_READ, VM_READ ) | _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) | _calc_vm_trans(prot, PROT_EXEC, VM_EXEC) | + _calc_vm_trans(prot, PROT_NOSYSCALL, VM_NOSYSCALL) | arch_calc_vm_prot_bits(prot, pkey); } diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index f94f65d429be..4eafbaa3f4bc 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -13,6 +13,8 @@ #define PROT_SEM 0x8 /* page may be used for atomic ops */ /* 0x10 reserved for arch-specific use */ /* 0x20 reserved for arch-specific use */ +#define PROT_NOSYSCALL 0x40 /* page cannot trigger syscalls */ + #define PROT_NONE 0x0 /* page can not be accessed */ #define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ #define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h index c1735455bc53..c7d042a409e7 100644 --- a/include/uapi/linux/seccomp.h +++ b/include/uapi/linux/seccomp.h @@ -10,12 +10,14 @@ #define SECCOMP_MODE_DISABLED 0 /* seccomp is not in use. */ #define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */ #define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */ +#define SECCOMP_MODE_MEMMAP 3 /* Lock syscalls per memory region. */ /* Valid operations for seccomp syscall. */ #define SECCOMP_SET_MODE_STRICT 0 #define SECCOMP_SET_MODE_FILTER 1 #define SECCOMP_GET_ACTION_AVAIL 2 #define SECCOMP_GET_NOTIF_SIZES 3 +#define SECCOMP_SET_MODE_MEMMAP 4 /* Valid flags for SECCOMP_SET_MODE_FILTER */ #define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 55a6184f5990..ebf09b02db8d 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -930,6 +930,55 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, } #endif +#ifdef CONFIG_SECCOMP_MEMMAP +static int __seccomp_memmap(int this_syscall, const struct seccomp_data *sd) +{ + struct vm_area_struct *vma = find_vma(current->mm, + sd->instruction_pointer); + if (!vma) + BUG(); + + if (!(vma->vm_flags & VM_NOSYSCALL)) + return 0; + + syscall_rollback(current, task_pt_regs(current)); + seccomp_send_sigsys(this_syscall, 0); + + seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_TRAP, true); + + return -1; +} + +static long seccomp_set_mode_memmap(unsigned int flags) +{ + const unsigned long seccomp_mode = SECCOMP_MODE_MEMMAP; + long ret = 0; + + if (flags & SECCOMP_FILTER_FLAG_TSYNC) + return -EINVAL; + + spin_lock_irq(¤t->sighand->siglock); + + if (seccomp_may_assign_mode(seccomp_mode)) + seccomp_assign_mode(current, seccomp_mode, flags); + else + ret = -EINVAL; + + spin_unlock_irq(¤t->sighand->siglock); + + return ret; +} +#else +static int __seccomp_memmap(int this_syscall, const struct seccomp_data *sd) +{ + BUG(); +} +static long seccomp_set_mode_memmap(unsigned int flags) +{ + return -EINVAL; +} +#endif /* CONFIG_SECCOMP_MEMMAP */ + int __secure_computing(const struct seccomp_data *sd) { int mode = current->seccomp.mode; @@ -948,6 +997,8 @@ int __secure_computing(const struct seccomp_data *sd) return 0; case SECCOMP_MODE_FILTER: return __seccomp_filter(this_syscall, sd, false); + case SECCOMP_MODE_MEMMAP: + return __seccomp_memmap(this_syscall, sd); default: BUG(); } @@ -1425,6 +1476,10 @@ static long do_seccomp(unsigned int op, unsigned int flags, return -EINVAL; return seccomp_get_notif_sizes(uargs); + case SECCOMP_SET_MODE_MEMMAP: + if (uargs != NULL) + return -EINVAL; + return seccomp_set_mode_memmap(flags); default: return -EINVAL; } @@ -1462,6 +1517,10 @@ long prctl_set_seccomp(unsigned long seccomp_mode, void __user *filter) op = SECCOMP_SET_MODE_FILTER; uargs = filter; break; + case SECCOMP_MODE_MEMMAP: + op = SECCOMP_SET_MODE_MEMMAP; + uargs = NULL; + break; default: return -EINVAL; } diff --git a/mm/mprotect.c b/mm/mprotect.c index 494192ca954b..6b5c00e8bbdc 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -591,7 +591,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len, * cleared from the VMA. */ mask_off_old_flags = VM_READ | VM_WRITE | VM_EXEC | - VM_FLAGS_CLEAR; + VM_NOSYSCALL | VM_FLAGS_CLEAR; new_vma_pkey = arch_override_mprotect_pkey(vma, prot, pkey); newflags = calc_vm_prot_bits(prot, new_vma_pkey); -- 2.27.0.rc2