Re: [RFC v1 00/17] seccomp-object: From attack surface reduction to sandboxing

Mickaël Salaün <mic@xxxxxxxxxxx> · Wed, 20 Apr 2016 20:21:11 +0200

Hi,

Does anyone had time to review some patches?

What do you think about the ToCToU workarounds?
What about the userland API?

The series can be found here: https://github.com/l0kod/linux/commits/seccomp-object-v1

 Mickaël

On 24/03/2016 02:46, Mickaël Salaün wrote:
> Hi,
> 
> This series is a proof of concept (not ready for production) to extend seccomp
> with the ability to check argument pointers of syscalls as kernel object (e.g.
> file path). This add a needed feature to create a full sandbox managed by
> userland like the Seatbelt/XNU Sandbox or the OpenBSD Pledge. It was initially
> inspired from a partial seccomp-LSM prototype [1] but has evolved a lot since :)
> 
> The audience for this RFC is limited to security-related actors to discuss
> about this new feature before enlarging the scope to a wider audience. This
> aims to focus on the security goal, usability and architecture before entering
> into the gory details of each subsystem. I also wish to get constructive
> criticisms about the userland API and intrusiveness of the code (and what could
> be the other ways to do it better) before going further (and addressing the
> TODO and FIXME in the code).
> 
> The approach taken is to add the minimum amount of code while still allowing
> the userland to create access rules via seccomp. The current limitation of
> seccomp is to get raw syscall arguments value but there is no way to
> dereference a pointer to check its content (e.g. the first argument of the open
> syscall). This seccomp evolution brings a generic way to check against argument
> pointer regardless from the syscall unlike current LSMs.
> 
> Here is the use case scenario:
> * First, a process must load some groups of seccomp checkers. This checkers are
>   dedicated structs describing a pointed data (e.g. path). They are
>   semantically grouped to be efficiently managed and checked in batch. Each
>   group have a static ID. This IDs are unique and they reference groups only
>   accessible from the filters created by the same process.
> * The loaded checkers are inherited and accessible by the newly created
>   filters. This groups can be referenced by filters with a new return value
>   SECCOMP_RET_ARGEVAL. Value in  SECCOMP_RET_DATA contains a group ID and an
>   argument bitmask. This return value is only meaningful between stacked
>   filters to ask a check and get the result in the extended struct
>   seccomp_data. The new fields are "is_valid_syscall", "arg_group" containing a
>   group ID and "matches[6]" consisting of one 64-bits mask per argument. This
>   bitmasks are useful to get the check result of each checker from a group on a
>   syscall argument which is handy to create a custom access control engine from
>   userland.
> * SECCOMP_RET_ARGEVAL is equivalent to SECCOMP_RET_ACCESS except that the
>   following filters can take a decision regarding a match (e.g. return EACCESS
>   or emulate the syscall).
> 
> Each checker is autonomous and new ones can easily be added in the future.
> There is currently two checkers for path objects:
> * SECCOMP_CHECK_FS_LITERAL checks if a string match a defined path;
> * SECCOMP_CHECK_FS_BENEATH checks if the path representation of a string is
>   equal or equivalent to a file belonging to a defined path.
> 
> This design does not seems too intrusive but is flexible enough to allow a
> powerful sandbox mechanism accessible by any process on Linux. The use of
> seccomp, including this new feature, is more suitable with the help of a
> userland library (e.g. libseccomp) that could help to specify a high-level
> language to express a security policy instead of raw syscall rules.
> 
> The main concern should be about time-of-check-time-of-use (TOCTOU) race
> conditions attacks. Because of the nature of seccomp (executed before the
> effective syscall and before a potential ptrace), it is not possible to block
> all races but to detect them.
> 
> There is still some questions I couldn't answer for sure (grep for FIXME or
> XXX). Comments appreciated.
> 
> Tested on the x86 and UM architectures in 32 and 64 bits (with audit enabled).
> 
> [1] https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git/log/?h=seccomp/lsm
> 
> 
> # Need for LSM
> 
> Because the arguments can be checked before the syscall actually evaluate them,
> there is two race condition classes:
> * The data pointed by the user address is in control of the userland (e.g. a
>   tracing process) and is so subject to TOCTOU race conditions between the
>   seccomp filter evaluation and the effective resource grabbing (part of each
>   syscall code).
> * The semantic of the pointed data is also subject to race condition because
>   there is no lock on the resource (e.g. file) between the evaluation of the
>   argument by the seccomp filter and the use of the pointed resource by each
>   part of the syscall code.
> 
> The solution to fix these race conditions is to copy the userspace data and to
> lock the pointed resource. Whereas it is easy to copy the userspace data, it is
> not realistic to lock any pointed resources because of obvious locking issues.
> However, it is possible to detect a TOCTOU race condition with the help of LSM
> hooks. This way, we can keep a flexible access control (e.g. by controlling
> syscall return values) while blocking unattended malicious or bogus userland
> behavior (e.g. exploit a race-condition).
> 
> To be able to deny access to a malicious userland behavior we must replay the
> seccomp filters and verify the intermediate return values to find out if the
> filters policy is still respected. Thanks to a cache we can detect if a check
> replay is necessary. Otherwise, the LSM hooks are really quick for
> non-malicious userland.
> 
> # Cache handling
> 
> Each time a checker is called, for each argument to check, it get them from
> it's seccomp_argeval_checked cache if any, or create a new cache entry and put
> it otherwise. This cache entries will be used to evaluate arguments.
> 
> When rechecking in the LSM hooks, first it find out which argument is mapped to
> the hook check and find if it differ from the corresponding cache entry. If it
> match, then return OK without replaying the checks, or if nothing match, replay
> all the checks from this check type.
> 
> # How to use it
> 
> The SECCOMP_ARGFLAG_* help to narrow the rules constraints:
> * SECCOMP_ARGFLAG_FS_DENTRY: Check and rely on the path name.
> * SECCOMP_ARGFLAG_FS_INODE: Check the data "container" whatever it's path name.
> * SECCOMP_ARGFLAG_FS_DEVICE: Check the device (i.e. file system) on which the
>   file is, e.g. it can be use to allow access to USB mass-storage or dm-verity
>   content only
> * SECCOMP_ARGFLAG_FS_MOUNT: Check the file mount point, e.g. can enforce a
>   read-only bind mount (but is less flexible than the other checks)
> * SECCOMP_ARGFLAG_FS_NOFOLLOW: Check the file without following it if it is a
>   symlink. Useful for rename(2) or open(2) with O_NOFOLLOW to have consistent
>   check. However, LSM hooks will deny all unattended accesses set by the rules
>   ignoring this flag (i.e. it act as a fail-safe).
> 
> # Limitations
> 
> ## Ptrace
> If a process can ptrace another one, the tracer can execute whatever syscall it
> wants without being constrained by any seccomp filter from the tracee. This
> apply for this seccomp extension as well. Any seccomp filter should then deny
> the use of ptrace.
> 
> The LSM hooks must ensure that the filters results are the same (with the same
> arguments) but must not deny any ptraced modifications (e.g. syscall argument
> change).
> 
> ## Stateless access
> Unlike current LSMs, the policies are stateless. It's not possible to mark and
> track a kernel object (e.g. file descriptor). Capsicum seems more appropriate
> for this kind of feature.
> 
> ## Resource usage
> We must limit the resources taken by a filter list, and so the number of rules,
> to not allow any process to exhaust the system.
> 
> 
> Regards,
> 
> Mickaël Salaün (17):
>   um: Export the sys_call_table
>   seccomp: Fix typo
>   selftest/seccomp: Fix the flag name SECCOMP_FILTER_FLAG_TSYNC
>   selftest/seccomp: Fix the seccomp(2) signature
>   security/seccomp: Add LSM and create arrays of syscall metadata
>   seccomp: Add the SECCOMP_ADD_CHECKER_GROUP command
>   seccomp: Add seccomp object checker evaluation
>   selftest/seccomp: Remove unknown_ret_is_kill_above_allow test
>   selftest/seccomp: Extend seccomp_data until matches[6]
>   selftest/seccomp: Add field_is_valid_syscall test
>   selftest/seccomp: Add argeval_open_whitelist test
>   audit,seccomp: Extend audit with seccomp state
>   selftest/seccomp: Rename TRACE_poke to TRACE_poke_sys_read
>   selftest/seccomp: Make tracer_poke() more generic
>   selftest/seccomp: Add argeval_toctou_argument test
>   security/seccomp: Protect against filesystem TOCTOU
>   selftest/seccomp: Add argeval_toctou_filesystem test
> 
>  arch/x86/um/asm/syscall.h                     |   2 +
>  include/asm-generic/vmlinux.lds.h             |  22 +
>  include/linux/audit.h                         |  25 ++
>  include/linux/compat.h                        |  10 +
>  include/linux/lsm_hooks.h                     |   5 +
>  include/linux/seccomp.h                       | 136 +++++-
>  include/linux/syscalls.h                      |  68 +++
>  include/uapi/linux/seccomp.h                  | 105 +++++
>  kernel/audit.h                                |   3 +
>  kernel/auditsc.c                              |  36 +-
>  kernel/fork.c                                 |  13 +-
>  kernel/seccomp.c                              | 594 +++++++++++++++++++++++++-
>  security/Kconfig                              |   1 +
>  security/Makefile                             |   2 +
>  security/seccomp/Kconfig                      |  14 +
>  security/seccomp/Makefile                     |   3 +
>  security/seccomp/checker_fs.c                 | 524 +++++++++++++++++++++++
>  security/seccomp/checker_fs.h                 |  18 +
>  security/seccomp/lsm.c                        | 135 ++++++
>  security/seccomp/lsm.h                        |  19 +
>  security/security.c                           |   1 +
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 572 +++++++++++++++++++++++--
>  22 files changed, 2248 insertions(+), 60 deletions(-)
>  create mode 100644 security/seccomp/Kconfig
>  create mode 100644 security/seccomp/Makefile
>  create mode 100644 security/seccomp/checker_fs.c
>  create mode 100644 security/seccomp/checker_fs.h
>  create mode 100644 security/seccomp/lsm.c
>  create mode 100644 security/seccomp/lsm.h
> 

Attachment:
signature.asc

Description: OpenPGP digital signature