On Thu, Sep 24, 2020 at 9:37 AM Kees Cook <keescook@xxxxxxxxxxxx> wrote: > On Thu, Sep 24, 2020 at 02:25:03AM +0200, Jann Horn wrote: > > On Thu, Sep 24, 2020 at 1:29 AM Kees Cook <keescook@xxxxxxxxxxxx> wrote: [...] > > (However, a "which syscalls have a fixed result" bitmap might make > > sense if we want to export the list of permitted syscalls as a text > > file in procfs, as I mentioned over at > > <https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@xxxxxxxxxxxxxx/>.) > > I haven't found a data structure I'm happy with for this. It seemed like > NR_syscalls * sizeof(u32) was rather a lot (i.e. to store the BPF_RET > value). However, let me discuss that more in the "why in in thread?" > below... [...] > > > +#endif > > > +}; > > > + > > > struct seccomp_filter; > > > /** > > > * struct seccomp - the state of a seccomp'ed process > > > @@ -45,6 +56,13 @@ struct seccomp { > > > #endif > > > atomic_t filter_count; > > > struct seccomp_filter *filter; > > > + struct seccomp_bitmaps native; > > > +#ifdef CONFIG_COMPAT > > > + struct seccomp_bitmaps compat; > > > +#endif > > > +#ifdef SECCOMP_MULTIPLEXED_SYSCALL_TABLE_ARCH > > > + struct seccomp_bitmaps multiplex; > > > +#endif > > > > Why do we have one bitmap per thread (in struct seccomp) instead of > > putting the bitmap for a given filter and all its ancestors into the > > seccomp_filter? > > I explicitly didn't want to add code that was run per-filter; I wanted > O(1), not O(n) even if the n work was a small constant. There is > obviously a memory/perf tradeoff here. I wonder if the middle ground > would be to put a bitmap and "constant action" results in the filter.... > oh duh. The "top" filter is already going to be composed with its > ancestors. That's all that needs to be checked. Yeah - when adding a new filter, you can evaluate each syscall for the newly added filter. For both the "accept" bitmap and the "constant action" bitmap, you can AND the bitmap of the existing filter into the new filter's bitmap. Although actually, I think my "constant action" bitmap proposal was a stupid idea... when someone asks for an analysis of the filter via procfs (which shouldn't be a common action, so speed doesn't really matter there), we can just dynamically evaluate the entire filter tree using our filter-evaluation helper. Let's drop the "constant action" bitmap idea. > Then the tri-state can be: > > bitmap accept[NR_syscalls]: accept or check "known" bitmap > bitmap filter[NR_syscalls]: run filter or return known action > u32 known_action[NR_syscalls]; Actually, maybe we should just have an "accept" list, nothing else, to keep it straightforward and with minimal memory usage... > (times syscall numbering "architecture" counts) > > Though perhaps it would be just as fast as: > > bitmap run_filter[NR_syscalls]: run filter or return known_action > u32 known_action[NR_syscalls]; > > where accept isn't treated special... Using a bitset for accepted syscalls instead of a big array would probably have far less cache impact on the syscall entry path. If we just have an "accept" bitmask, we can store information about 512 syscalls per cache line - that's almost the entire syscall table. In contrast, a known_action list can only store information about 16 syscalls in a cache line, and we'd additionally still have to query the "filter" bitmap. I think our goal here should be that if a syscall is always allowed, seccomp should execute the smallest amount of instructions we can get away with, and touch the smallest amount of memory possible (and preferably that memory should be shared between threads). The bitmap fastpath should probably also avoid populate_seccomp_data().