On Wed, Apr 20, 2016 at 11:21 AM, Mickaël Salaün <mic@xxxxxxxxxxx> wrote: > Hi, > > Does anyone had time to review some patches? Hi! Sorry for the delay on this. I keep getting distracted by other stuff. I've got some time on a plane tomorrow, so I'll bring your series along and spend some time reading through it more carefully. -Kees > > What do you think about the ToCToU workarounds? > What about the userland API? > > The series can be found here: https://github.com/l0kod/linux/commits/seccomp-object-v1 > > Mickaël > > > On 24/03/2016 02:46, Mickaël Salaün wrote: >> Hi, >> >> This series is a proof of concept (not ready for production) to extend seccomp >> with the ability to check argument pointers of syscalls as kernel object (e.g. >> file path). This add a needed feature to create a full sandbox managed by >> userland like the Seatbelt/XNU Sandbox or the OpenBSD Pledge. It was initially >> inspired from a partial seccomp-LSM prototype [1] but has evolved a lot since :) >> >> The audience for this RFC is limited to security-related actors to discuss >> about this new feature before enlarging the scope to a wider audience. This >> aims to focus on the security goal, usability and architecture before entering >> into the gory details of each subsystem. I also wish to get constructive >> criticisms about the userland API and intrusiveness of the code (and what could >> be the other ways to do it better) before going further (and addressing the >> TODO and FIXME in the code). >> >> The approach taken is to add the minimum amount of code while still allowing >> the userland to create access rules via seccomp. The current limitation of >> seccomp is to get raw syscall arguments value but there is no way to >> dereference a pointer to check its content (e.g. the first argument of the open >> syscall). This seccomp evolution brings a generic way to check against argument >> pointer regardless from the syscall unlike current LSMs. >> >> Here is the use case scenario: >> * First, a process must load some groups of seccomp checkers. This checkers are >> dedicated structs describing a pointed data (e.g. path). They are >> semantically grouped to be efficiently managed and checked in batch. Each >> group have a static ID. This IDs are unique and they reference groups only >> accessible from the filters created by the same process. >> * The loaded checkers are inherited and accessible by the newly created >> filters. This groups can be referenced by filters with a new return value >> SECCOMP_RET_ARGEVAL. Value in SECCOMP_RET_DATA contains a group ID and an >> argument bitmask. This return value is only meaningful between stacked >> filters to ask a check and get the result in the extended struct >> seccomp_data. The new fields are "is_valid_syscall", "arg_group" containing a >> group ID and "matches[6]" consisting of one 64-bits mask per argument. This >> bitmasks are useful to get the check result of each checker from a group on a >> syscall argument which is handy to create a custom access control engine from >> userland. >> * SECCOMP_RET_ARGEVAL is equivalent to SECCOMP_RET_ACCESS except that the >> following filters can take a decision regarding a match (e.g. return EACCESS >> or emulate the syscall). >> >> Each checker is autonomous and new ones can easily be added in the future. >> There is currently two checkers for path objects: >> * SECCOMP_CHECK_FS_LITERAL checks if a string match a defined path; >> * SECCOMP_CHECK_FS_BENEATH checks if the path representation of a string is >> equal or equivalent to a file belonging to a defined path. >> >> This design does not seems too intrusive but is flexible enough to allow a >> powerful sandbox mechanism accessible by any process on Linux. The use of >> seccomp, including this new feature, is more suitable with the help of a >> userland library (e.g. libseccomp) that could help to specify a high-level >> language to express a security policy instead of raw syscall rules. >> >> The main concern should be about time-of-check-time-of-use (TOCTOU) race >> conditions attacks. Because of the nature of seccomp (executed before the >> effective syscall and before a potential ptrace), it is not possible to block >> all races but to detect them. >> >> There is still some questions I couldn't answer for sure (grep for FIXME or >> XXX). Comments appreciated. >> >> Tested on the x86 and UM architectures in 32 and 64 bits (with audit enabled). >> >> [1] https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/lsm >> >> >> # Need for LSM >> >> Because the arguments can be checked before the syscall actually evaluate them, >> there is two race condition classes: >> * The data pointed by the user address is in control of the userland (e.g. a >> tracing process) and is so subject to TOCTOU race conditions between the >> seccomp filter evaluation and the effective resource grabbing (part of each >> syscall code). >> * The semantic of the pointed data is also subject to race condition because >> there is no lock on the resource (e.g. file) between the evaluation of the >> argument by the seccomp filter and the use of the pointed resource by each >> part of the syscall code. >> >> The solution to fix these race conditions is to copy the userspace data and to >> lock the pointed resource. Whereas it is easy to copy the userspace data, it is >> not realistic to lock any pointed resources because of obvious locking issues. >> However, it is possible to detect a TOCTOU race condition with the help of LSM >> hooks. This way, we can keep a flexible access control (e.g. by controlling >> syscall return values) while blocking unattended malicious or bogus userland >> behavior (e.g. exploit a race-condition). >> >> To be able to deny access to a malicious userland behavior we must replay the >> seccomp filters and verify the intermediate return values to find out if the >> filters policy is still respected. Thanks to a cache we can detect if a check >> replay is necessary. Otherwise, the LSM hooks are really quick for >> non-malicious userland. >> >> # Cache handling >> >> Each time a checker is called, for each argument to check, it get them from >> it's seccomp_argeval_checked cache if any, or create a new cache entry and put >> it otherwise. This cache entries will be used to evaluate arguments. >> >> When rechecking in the LSM hooks, first it find out which argument is mapped to >> the hook check and find if it differ from the corresponding cache entry. If it >> match, then return OK without replaying the checks, or if nothing match, replay >> all the checks from this check type. >> >> # How to use it >> >> The SECCOMP_ARGFLAG_* help to narrow the rules constraints: >> * SECCOMP_ARGFLAG_FS_DENTRY: Check and rely on the path name. >> * SECCOMP_ARGFLAG_FS_INODE: Check the data "container" whatever it's path name. >> * SECCOMP_ARGFLAG_FS_DEVICE: Check the device (i.e. file system) on which the >> file is, e.g. it can be use to allow access to USB mass-storage or dm-verity >> content only >> * SECCOMP_ARGFLAG_FS_MOUNT: Check the file mount point, e.g. can enforce a >> read-only bind mount (but is less flexible than the other checks) >> * SECCOMP_ARGFLAG_FS_NOFOLLOW: Check the file without following it if it is a >> symlink. Useful for rename(2) or open(2) with O_NOFOLLOW to have consistent >> check. However, LSM hooks will deny all unattended accesses set by the rules >> ignoring this flag (i.e. it act as a fail-safe). >> >> # Limitations >> >> ## Ptrace >> If a process can ptrace another one, the tracer can execute whatever syscall it >> wants without being constrained by any seccomp filter from the tracee. This >> apply for this seccomp extension as well. Any seccomp filter should then deny >> the use of ptrace. >> >> The LSM hooks must ensure that the filters results are the same (with the same >> arguments) but must not deny any ptraced modifications (e.g. syscall argument >> change). >> >> ## Stateless access >> Unlike current LSMs, the policies are stateless. It's not possible to mark and >> track a kernel object (e.g. file descriptor). Capsicum seems more appropriate >> for this kind of feature. >> >> ## Resource usage >> We must limit the resources taken by a filter list, and so the number of rules, >> to not allow any process to exhaust the system. >> >> >> Regards, >> >> Mickaël Salaün (17): >> um: Export the sys_call_table >> seccomp: Fix typo >> selftest/seccomp: Fix the flag name SECCOMP_FILTER_FLAG_TSYNC >> selftest/seccomp: Fix the seccomp(2) signature >> security/seccomp: Add LSM and create arrays of syscall metadata >> seccomp: Add the SECCOMP_ADD_CHECKER_GROUP command >> seccomp: Add seccomp object checker evaluation >> selftest/seccomp: Remove unknown_ret_is_kill_above_allow test >> selftest/seccomp: Extend seccomp_data until matches[6] >> selftest/seccomp: Add field_is_valid_syscall test >> selftest/seccomp: Add argeval_open_whitelist test >> audit,seccomp: Extend audit with seccomp state >> selftest/seccomp: Rename TRACE_poke to TRACE_poke_sys_read >> selftest/seccomp: Make tracer_poke() more generic >> selftest/seccomp: Add argeval_toctou_argument test >> security/seccomp: Protect against filesystem TOCTOU >> selftest/seccomp: Add argeval_toctou_filesystem test >> >> arch/x86/um/asm/syscall.h | 2 + >> include/asm-generic/vmlinux.lds.h | 22 + >> include/linux/audit.h | 25 ++ >> include/linux/compat.h | 10 + >> include/linux/lsm_hooks.h | 5 + >> include/linux/seccomp.h | 136 +++++- >> include/linux/syscalls.h | 68 +++ >> include/uapi/linux/seccomp.h | 105 +++++ >> kernel/audit.h | 3 + >> kernel/auditsc.c | 36 +- >> kernel/fork.c | 13 +- >> kernel/seccomp.c | 594 +++++++++++++++++++++++++- >> security/Kconfig | 1 + >> security/Makefile | 2 + >> security/seccomp/Kconfig | 14 + >> security/seccomp/Makefile | 3 + >> security/seccomp/checker_fs.c | 524 +++++++++++++++++++++++ >> security/seccomp/checker_fs.h | 18 + >> security/seccomp/lsm.c | 135 ++++++ >> security/seccomp/lsm.h | 19 + >> security/security.c | 1 + >> tools/testing/selftests/seccomp/seccomp_bpf.c | 572 +++++++++++++++++++++++-- >> 22 files changed, 2248 insertions(+), 60 deletions(-) >> create mode 100644 security/seccomp/Kconfig >> create mode 100644 security/seccomp/Makefile >> create mode 100644 security/seccomp/checker_fs.c >> create mode 100644 security/seccomp/checker_fs.h >> create mode 100644 security/seccomp/lsm.c >> create mode 100644 security/seccomp/lsm.h >> > -- Kees Cook Chrome OS & Brillo Security -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html