On Fri, Apr 6, 2012 at 1:23 PM, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > On Thu, 29 Mar 2012 15:01:53 -0500 > Will Drewry <wad@xxxxxxxxxxxx> wrote: > >> [This patch depends on luto@xxxxxxx's no_new_privs patch: >> https://lkml.org/lkml/2012/1/30/264 >> included in this series for ease of consumption. >> ] >> >> This patch adds support for seccomp mode 2. Mode 2 introduces the >> ability for unprivileged processes to install system call filtering >> policy expressed in terms of a Berkeley Packet Filter (BPF) program. >> This program will be evaluated in the kernel for each system call >> the task makes and computes a result based on data in the format >> of struct seccomp_data. >> ... >> +static void seccomp_filter_log_failure(int syscall) >> +{ >> + int compat = 0; >> +#ifdef CONFIG_COMPAT >> + compat = is_compat_task(); >> +#endif > > hm, I'm surprised that we don't have a zero-returning implementation of > is_compat_task() when CONFIG_COMPAT=n. Seems silly. Blames Arnd. There is, but this chunk of the patch is removed later on in the series. We could just drop seccomp_filter_log_failure and it's single call entirely and depend on the later enhancement to add the logging that is desired. >> +static long seccomp_attach_filter(struct sock_fprog *fprog) >> +{ >> + struct seccomp_filter *filter; >> + unsigned long fp_size = fprog->len * sizeof(struct sock_filter); >> + unsigned long total_insns = fprog->len; >> + long ret; >> + >> + if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) >> + return -EINVAL; >> + >> + for (filter = current->seccomp.filter; filter; filter = filter->prev) >> + total_insns += filter->len + 4; /* include a 4 instr penalty */ > > So tasks don't share filters? We copy them by value at fork? Do we do > this at vfork() too? The filter chain is shared (and refcounted). >> + if (total_insns > MAX_INSNS_PER_PATH) >> + return -ENOMEM; >> + >> + /* >> + * Installing a seccomp filter requires that the task have >> + * CAP_SYS_ADMIN in its namespace or be running with no_new_privs. >> + * This avoids scenarios where unprivileged tasks can affect the >> + * behavior of privileged children. >> + */ >> + if (!current->no_new_privs && >> + security_capable_noaudit(current_cred(), current_user_ns(), >> + CAP_SYS_ADMIN) != 0) >> + return -EACCES; >> + >> + /* Allocate a new seccomp_filter */ >> + filter = kzalloc(sizeof(struct seccomp_filter) + fp_size, GFP_KERNEL); > > I think this gives userspace an easy way of causing page allocation > failure warnings, by permitting large kmalloc() attempts. Add > __GFP_NOWARN? This is likely mitigated by the MAX_INSNS_PER_PATH check above, but, yeah, there's no harm in adding __GFP_NOWARN. >> +/* put_seccomp_filter - decrements the ref count of tsk->seccomp.filter */ >> +void put_seccomp_filter(struct task_struct *tsk) >> +{ >> + struct seccomp_filter *orig = tsk->seccomp.filter; >> + /* Clean up single-reference branches iteratively. */ >> + while (orig && atomic_dec_and_test(&orig->usage)) { >> + struct seccomp_filter *freeme = orig; >> + orig = orig->prev; >> + kfree(freeme); >> + } >> +} > > So if one of the filters in the list has an elevated refcount, we bail > out on the remainder of the list. Seems odd. This so that every filter in the list doesn't need to have their refcount raised. As long as the counting up matching the counting down, it's fine. This allows for process trees branching the filter list at different times still being safe. IIUC, this code was based on how namespace refcounting is handled. I spent some time proving to myself that it was correctly refcounted a while back. More eyes is better, of course. :) -Kees -- Kees Cook ChromeOS Security -- To unsubscribe from this list: send the line "unsubscribe linux-arch" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html