On Thu, Sep 24, 2020 at 02:25:03AM +0200, Jann Horn wrote: > On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@xxxxxxxxxxxx> wrote: > > +/* When no bits are set for a syscall, filters are run. */ > > +struct seccomp_bitmaps { > > +#ifdef SECCOMP_ARCH > > + /* "allow" are initialized to set and only ever get cleared. */ > > + DECLARE_BITMAP(allow, NR_syscalls); > > This bitmap makes sense. > > > + /* These are initialized to clear and only ever get set. */ > > + DECLARE_BITMAP(kill_thread, NR_syscalls); > > + DECLARE_BITMAP(kill_process, NR_syscalls); > > I don't think these bitmaps make sense, this is not part of any fastpath. That's a fair point. I think I arrived at this design because it ended up making filter addition faster ("don't bother processing this one, it's already 'kill'"), but it's likely not worse the memory usage trade-off. > (However, a "which syscalls have a fixed result" bitmap might make > sense if we want to export the list of permitted syscalls as a text > file in procfs, as I mentioned over at > <https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@xxxxxxxxxxxxxx/>.) I haven't found a data structure I'm happy with for this. It seemed like NR_syscalls * sizeof(u32) was rather a lot (i.e. to store the BPF_RET value). However, let me discuss that more in the "why in in thread?" below... > The "NR_syscalls" part assumes that the compat syscall tables will not > be bigger than the native syscall table, right? I guess that's usually > mostly true nowadays, thanks to the syscall table unification... > (might be worth a comment though) Hrm, I had convinced myself it was a max() of compat. But I see no evidence of that now. Which means that I can add these to the per-arch seccomp defines with something like: # define SECCOMP_NR_NATIVE NR_syscalls # define SECCOMP_NR_COMPAT X32_NR_syscalls ... > > +#endif > > +}; > > + > > struct seccomp_filter; > > /** > > * struct seccomp - the state of a seccomp'ed process > > @@ -45,6 +56,13 @@ struct seccomp { > > #endif > > atomic_t filter_count; > > struct seccomp_filter *filter; > > + struct seccomp_bitmaps native; > > +#ifdef CONFIG_COMPAT > > + struct seccomp_bitmaps compat; > > +#endif > > +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH > > + struct seccomp_bitmaps multiplex; > > +#endif > > Why do we have one bitmap per thread (in struct seccomp) instead of > putting the bitmap for a given filter and all its ancestors into the > seccomp_filter? I explicitly didn't want to add code that was run per-filter; I wanted O(1), not O(n) even if the n work was a small constant. There is obviously a memory/perf tradeoff here. I wonder if the middle ground would be to put a bitmap and "constant action" results in the filter.... oh duh. The "top" filter is already going to be composed with its ancestors. That's all that needs to be checked. Then the tri-state can be: bitmap accept[NR_syscalls]: accept or check "known" bitmap bitmap filter[NR_syscalls]: run filter or return known action u32 known_action[NR_syscalls]; (times syscall numbering "architecture" counts) Though perhaps it would be just as fast as: bitmap run_filter[NR_syscalls]: run filter or return known_action u32 known_action[NR_syscalls]; where accept isn't treated special... > > > }; > > > > #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > > index 0a3ff8eb8aea..111a238bc532 100644 > > --- a/kernel/seccomp.c > > +++ b/kernel/seccomp.c > > @@ -318,7 +318,7 @@ static inline u8 seccomp_get_arch(u32 syscall_arch, u32 syscall_nr) > > > > #ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH > > if (syscall_arch == SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH) { > > - seccomp_arch |= (sd->nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >> > > + seccomp_arch |= (syscall_nr & SECCOMP_MULTIPLEXED_SYSCALL_TABLE_MASK) >> > > SECCOMP_MULTIPLEXED_SYSCALL_TABLE_SHIFT; > > This belongs over into patch 1. Thanks! I was rushing to get this posted so YiFei Zhu wouldn't spend time fighting with arch and Kconfig stuff. :) I'll clean this (and the other random cruft) up. -- Kees Cook _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers