On Thu, Sep 24, 2020 at 02:41:36AM +0200, Jann Horn wrote: > On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@xxxxxxxxxxxx> wrote: > > For systems that provide multiple syscall maps based on audit > > architectures (e.g. AUDIT_ARCH_X86_64 and AUDIT_ARCH_I386 via > > CONFIG_COMPAT) or via syscall masks (e.g. x86_x32), allow a fast way > > to pin the process to a specific syscall table, instead of needing > > to generate all filters with an architecture check as the first filter > > action. > > > > This creates the internal representation that seccomp itself can use > > (which is separate from the filters, which need to stay runtime > > agnostic). Additionally paves the way for constant-action bitmaps. > > I don't really see the point in providing this UAPI - the syscall > number checking will probably have much more performance cost than the > architecture number check, and it's not like this lets us avoid the > check, we're just moving it over into C code. It's desirable for libseccomp and is a request from systemd (which is, at this point, the largest seccomp user I know of), as they have no way to force an arch without doing it in filters, which doesn't help much with reducing filter runtime. > > > Signed-off-by: Kees Cook <keescook@xxxxxxxxxxxx> > > --- > > include/linux/seccomp.h | 9 +++ > > include/uapi/linux/seccomp.h | 1 + > > kernel/seccomp.c | 79 ++++++++++++++++++- > > tools/testing/selftests/seccomp/seccomp_bpf.c | 33 ++++++++ > > 4 files changed, 120 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h > > index 02aef2844c38..0be20bc81ea9 100644 > > --- a/include/linux/seccomp.h > > +++ b/include/linux/seccomp.h > > @@ -20,12 +20,18 @@ > > #include <linux/atomic.h> > > #include <asm/seccomp.h> > > > > +#define SECCOMP_ARCH_IS_NATIVE 1 > > +#define SECCOMP_ARCH_IS_COMPAT 2 > > FYI, mips has three different possible "arch" values (per kernel build > config; the __AUDIT_ARCH_LE flag can also be set, but that's fixed > based on the config): > > - AUDIT_ARCH_MIPS > - AUDIT_ARCH_MIPS | __AUDIT_ARCH_64BIT > - AUDIT_ARCH_MIPS | __AUDIT_ARCH_64BIT | __AUDIT_ARCH_CONVENTION_MIPS64_N32 > > But I guess we can deal with that once someone wants to actually add > support for this on mips. Yup! > > > +#define SECCOMP_ARCH_IS_MULTIPLEX 3 > > Why should X32 be handled specially? If the seccomp filter allows Because it's a masked lookup into a separate table; the syscalls don't map to x86_64's table; so for seccomp to correctly figure out which bitmap to use, it has to do this decoding. > specific syscalls (as it should), we don't have to care about X32. > Only in weird cases where the seccomp filter wants to deny specific > syscalls (a horrible idea), X32 is a concern, and in such cases, the > userspace code can generate a single conditional jump to deal with it. I feel like I must not understand what you mean. The x32-aware seccomp filters are using syscall tests with 0x40000000 included in the values. So seccomp's bitmap cannot handle this because it must know how many syscalls to include in a linearly-allocated bitmap. > And when seccomp is used properly to allow specific syscalls, the > kernel will just waste time uselessly checking this X32 stuff. It not measurable in my tests -- seccomp_data::nr is rather hot in the cache. ;) That said, if it's unwanted, then CONFIG_X86_X32=n is the way to go. > [...] > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > [...] > > +static long seccomp_pin_architecture(void) > > +{ > > +#ifdef SECCOMP_ARCH > > + struct task_struct *task = current; > > + > > + u8 arch = seccomp_get_arch(syscall_get_arch(task), > > + syscall_get_nr(task, task_pt_regs(task))); > > + > > + /* How did you even get here? */ > > Via a racing TSYNC, that's how. Yes; thanks. This will need to take ¤t->sighand->siglock. > > > + if (task->seccomp.arch && task->seccomp.arch != arch) > > + return -EBUSY; > > + > > + task->seccomp.arch = arch; > > +#endif > > + return 0; > > +} > > Why does this return 0 if SECCOMP_ARCH is not defined? That suggests > to userspace that we have successfully pinned the ABI, even though > we're actually unable to do so. Yup; thanks for the catch. This is a logical leftover from the RFC. This should be, I think: + task->seccomp.arch = arch; + return 0; +#else + return -EINVAL; +#endif -- Kees Cook _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers