On Mon, Jul 8, 2024 at 9:17 AM Jeff Xu <jeffxu@xxxxxxxxxx> wrote: > > Hi > > On Thu, Jul 4, 2024 at 12:02 PM Mickaël Salaün <mic@xxxxxxxxxxx> wrote: > > > > These new SECBIT_SHOULD_EXEC_CHECK, SECBIT_SHOULD_EXEC_RESTRICT, and > > their *_LOCKED counterparts are designed to be set by processes setting > > up an execution environment, such as a user session, a container, or a > > security sandbox. Like seccomp filters or Landlock domains, the > > securebits are inherited across proceses. > > > > When SECBIT_SHOULD_EXEC_CHECK is set, programs interpreting code should > > check executable resources with execveat(2) + AT_CHECK (see previous > > patch). > > > > When SECBIT_SHOULD_EXEC_RESTRICT is set, a process should only allow > > execution of approved resources, if any (see SECBIT_SHOULD_EXEC_CHECK). > > > Do we need both bits ? > When CHECK is set and RESTRICT is not, the "check fail" executable > will still get executed, so CHECK is for logging ? > Does RESTRICT imply CHECK is set, e.g. What if CHECK=0 and RESTRICT = 1 ? > The intention might be "permissive mode"? if so, consider reuse existing selinux's concept, and still with 2 bits: SECBIT_SHOULD_EXEC_RESTRICT SECBIT_SHOULD_EXEC_RESTRICT_PERMISSIVE -Jeff > > For a secure environment, we might also want > > SECBIT_SHOULD_EXEC_CHECK_LOCKED and SECBIT_SHOULD_EXEC_RESTRICT_LOCKED > > to be set. For a test environment (e.g. testing on a fleet to identify > > potential issues), only the SECBIT_SHOULD_EXEC_CHECK* bits can be set to > > still be able to identify potential issues (e.g. with interpreters logs > > or LSMs audit entries). > > > > It should be noted that unlike other security bits, the > > SECBIT_SHOULD_EXEC_CHECK and SECBIT_SHOULD_EXEC_RESTRICT bits are > > dedicated to user space willing to restrict itself. Because of that, > > they only make sense in the context of a trusted environment (e.g. > > sandbox, container, user session, full system) where the process > > changing its behavior (according to these bits) and all its parent > > processes are trusted. Otherwise, any parent process could just execute > > its own malicious code (interpreting a script or not), or even enforce a > > seccomp filter to mask these bits. > > > > Such a secure environment can be achieved with an appropriate access > > control policy (e.g. mount's noexec option, file access rights, LSM > > configuration) and an enlighten ld.so checking that libraries are > > allowed for execution e.g., to protect against illegitimate use of > > LD_PRELOAD. > > > > Scripts may need some changes to deal with untrusted data (e.g. stdin, > > environment variables), but that is outside the scope of the kernel. > > > > The only restriction enforced by the kernel is the right to ptrace > > another process. Processes are denied to ptrace less restricted ones, > > unless the tracer has CAP_SYS_PTRACE. This is mainly a safeguard to > > avoid trivial privilege escalations e.g., by a debugging process being > > abused with a confused deputy attack. > > > > Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > > Cc: Christian Brauner <brauner@xxxxxxxxxx> > > Cc: Kees Cook <keescook@xxxxxxxxxxxx> > > Cc: Paul Moore <paul@xxxxxxxxxxxxxx> > > Signed-off-by: Mickaël Salaün <mic@xxxxxxxxxxx> > > Link: https://lore.kernel.org/r/20240704190137.696169-3-mic@xxxxxxxxxxx > > --- > > > > New design since v18: > > https://lore.kernel.org/r/20220104155024.48023-3-mic@xxxxxxxxxxx > > --- > > include/uapi/linux/securebits.h | 56 ++++++++++++++++++++++++++++- > > security/commoncap.c | 63 ++++++++++++++++++++++++++++----- > > 2 files changed, 110 insertions(+), 9 deletions(-) > > > > diff --git a/include/uapi/linux/securebits.h b/include/uapi/linux/securebits.h > > index d6d98877ff1a..3fdb0382718b 100644 > > --- a/include/uapi/linux/securebits.h > > +++ b/include/uapi/linux/securebits.h > > @@ -52,10 +52,64 @@ > > #define SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED \ > > (issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE_LOCKED)) > > > > +/* > > + * When SECBIT_SHOULD_EXEC_CHECK is set, a process should check all executable > > + * files with execveat(2) + AT_CHECK. However, such check should only be > > + * performed if all to-be-executed code only comes from regular files. For > > + * instance, if a script interpreter is called with both a script snipped as > > + * argument and a regular file, the interpreter should not check any file. > > + * Doing otherwise would mislead the kernel to think that only the script file > > + * is being executed, which could for instance lead to unexpected permission > > + * change and break current use cases. > > + * > > + * This secure bit may be set by user session managers, service managers, > > + * container runtimes, sandboxer tools... Except for test environments, the > > + * related SECBIT_SHOULD_EXEC_CHECK_LOCKED bit should also be set. > > + * > > + * Ptracing another process is deny if the tracer has SECBIT_SHOULD_EXEC_CHECK > > + * but not the tracee. SECBIT_SHOULD_EXEC_CHECK_LOCKED also checked. > > + */ > > +#define SECURE_SHOULD_EXEC_CHECK 8 > > +#define SECURE_SHOULD_EXEC_CHECK_LOCKED 9 /* make bit-8 immutable */ > > + > > +#define SECBIT_SHOULD_EXEC_CHECK (issecure_mask(SECURE_SHOULD_EXEC_CHECK)) > > +#define SECBIT_SHOULD_EXEC_CHECK_LOCKED \ > > + (issecure_mask(SECURE_SHOULD_EXEC_CHECK_LOCKED)) > > + > > +/* > > + * When SECBIT_SHOULD_EXEC_RESTRICT is set, a process should only allow > > + * execution of approved files, if any (see SECBIT_SHOULD_EXEC_CHECK). For > > + * instance, script interpreters called with a script snippet as argument > > + * should always deny such execution if SECBIT_SHOULD_EXEC_RESTRICT is set. > > + * However, if a script interpreter is called with both > > + * SECBIT_SHOULD_EXEC_CHECK and SECBIT_SHOULD_EXEC_RESTRICT, they should > > + * interpret the provided script files if no unchecked code is also provided > > + * (e.g. directly as argument). > > + * > > + * This secure bit may be set by user session managers, service managers, > > + * container runtimes, sandboxer tools... Except for test environments, the > > + * related SECBIT_SHOULD_EXEC_RESTRICT_LOCKED bit should also be set. > > + * > > + * Ptracing another process is deny if the tracer has > > + * SECBIT_SHOULD_EXEC_RESTRICT but not the tracee. > > + * SECBIT_SHOULD_EXEC_RESTRICT_LOCKED is also checked. > > + */ > > +#define SECURE_SHOULD_EXEC_RESTRICT 10 > > +#define SECURE_SHOULD_EXEC_RESTRICT_LOCKED 11 /* make bit-8 immutable */ > > + > > +#define SECBIT_SHOULD_EXEC_RESTRICT (issecure_mask(SECURE_SHOULD_EXEC_RESTRICT)) > > +#define SECBIT_SHOULD_EXEC_RESTRICT_LOCKED \ > > + (issecure_mask(SECURE_SHOULD_EXEC_RESTRICT_LOCKED)) > > + > > #define SECURE_ALL_BITS (issecure_mask(SECURE_NOROOT) | \ > > issecure_mask(SECURE_NO_SETUID_FIXUP) | \ > > issecure_mask(SECURE_KEEP_CAPS) | \ > > - issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE)) > > + issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE) | \ > > + issecure_mask(SECURE_SHOULD_EXEC_CHECK) | \ > > + issecure_mask(SECURE_SHOULD_EXEC_RESTRICT)) > > #define SECURE_ALL_LOCKS (SECURE_ALL_BITS << 1) > > > > +#define SECURE_ALL_UNPRIVILEGED (issecure_mask(SECURE_SHOULD_EXEC_CHECK) | \ > > + issecure_mask(SECURE_SHOULD_EXEC_RESTRICT)) > > + > > #endif /* _UAPI_LINUX_SECUREBITS_H */ > > diff --git a/security/commoncap.c b/security/commoncap.c > > index 162d96b3a676..34b4493e2a69 100644 > > --- a/security/commoncap.c > > +++ b/security/commoncap.c > > @@ -117,6 +117,33 @@ int cap_settime(const struct timespec64 *ts, const struct timezone *tz) > > return 0; > > } > > > > +static bool ptrace_secbits_allowed(const struct cred *tracer, > > + const struct cred *tracee) > > +{ > > + const unsigned long tracer_secbits = SECURE_ALL_UNPRIVILEGED & > > + tracer->securebits; > > + const unsigned long tracee_secbits = SECURE_ALL_UNPRIVILEGED & > > + tracee->securebits; > > + /* Ignores locking of unset secure bits (cf. SECURE_ALL_LOCKS). */ > > + const unsigned long tracer_locked = (tracer_secbits << 1) & > > + tracer->securebits; > > + const unsigned long tracee_locked = (tracee_secbits << 1) & > > + tracee->securebits; > > + > > + /* The tracee must not have less constraints than the tracer. */ > > + if ((tracer_secbits | tracee_secbits) != tracee_secbits) > > + return false; > > + > > + /* > > + * Makes sure that the tracer's locks for restrictions are the same for > > + * the tracee. > > + */ > > + if ((tracer_locked | tracee_locked) != tracee_locked) > > + return false; > > + > > + return true; > > +} > > + > > /** > > * cap_ptrace_access_check - Determine whether the current process may access > > * another > > @@ -146,7 +173,8 @@ int cap_ptrace_access_check(struct task_struct *child, unsigned int mode) > > else > > caller_caps = &cred->cap_permitted; > > if (cred->user_ns == child_cred->user_ns && > > - cap_issubset(child_cred->cap_permitted, *caller_caps)) > > + cap_issubset(child_cred->cap_permitted, *caller_caps) && > > + ptrace_secbits_allowed(cred, child_cred)) > > goto out; > > if (ns_capable(child_cred->user_ns, CAP_SYS_PTRACE)) > > goto out; > > @@ -178,7 +206,8 @@ int cap_ptrace_traceme(struct task_struct *parent) > > cred = __task_cred(parent); > > child_cred = current_cred(); > > if (cred->user_ns == child_cred->user_ns && > > - cap_issubset(child_cred->cap_permitted, cred->cap_permitted)) > > + cap_issubset(child_cred->cap_permitted, cred->cap_permitted) && > > + ptrace_secbits_allowed(cred, child_cred)) > > goto out; > > if (has_ns_capability(parent, child_cred->user_ns, CAP_SYS_PTRACE)) > > goto out; > > @@ -1302,21 +1331,39 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3, > > & (old->securebits ^ arg2)) /*[1]*/ > > || ((old->securebits & SECURE_ALL_LOCKS & ~arg2)) /*[2]*/ > > || (arg2 & ~(SECURE_ALL_LOCKS | SECURE_ALL_BITS)) /*[3]*/ > > - || (cap_capable(current_cred(), > > - current_cred()->user_ns, > > - CAP_SETPCAP, > > - CAP_OPT_NONE) != 0) /*[4]*/ > > /* > > * [1] no changing of bits that are locked > > * [2] no unlocking of locks > > * [3] no setting of unsupported bits > > - * [4] doing anything requires privilege (go read about > > - * the "sendmail capabilities bug") > > */ > > ) > > /* cannot change a locked bit */ > > return -EPERM; > > > > + /* > > + * Doing anything requires privilege (go read about the > > + * "sendmail capabilities bug"), except for unprivileged bits. > > + * Indeed, the SECURE_ALL_UNPRIVILEGED bits are not > > + * restrictions enforced by the kernel but by user space on > > + * itself. The kernel is only in charge of protecting against > > + * privilege escalation with ptrace protections. > > + */ > > + if (cap_capable(current_cred(), current_cred()->user_ns, > > + CAP_SETPCAP, CAP_OPT_NONE) != 0) { > > + const unsigned long unpriv_and_locks = > > + SECURE_ALL_UNPRIVILEGED | > > + SECURE_ALL_UNPRIVILEGED << 1; > > + const unsigned long changed = old->securebits ^ arg2; > > + > > + /* For legacy reason, denies non-change. */ > > + if (!changed) > > + return -EPERM; > > + > > + /* Denies privileged changes. */ > > + if (changed & ~unpriv_and_locks) > > + return -EPERM; > > + } > > + > > new = prepare_creds(); > > if (!new) > > return -ENOMEM; > > -- > > 2.45.2 > >