On Mon, Sep 21, 2020 at 8:51 AM Tycho Andersen <tycho@tycho.pizza> wrote: > One problem with a kernel config setting is that it's for all tasks. > While docker and systemd may make decsisions based on syscall number, > other applications may have more nuanced filters, and this cache would > yield incorrect results. > > You could work around this by making this a filter flag instead; > filter authors would generally know whether their filter results can > be cached and probably be motivated to opt in if their users are > complaining about slow syscall execution. > > Tycho Yielding incorrect results should not be possible. The purpose of the "emulator" (for the lack of a better term) is to determine whether the filter reads any syscall arguments. A read from a syscall argument must go through the BPF_LD | BPF_ABS instruction, where the 32 bit multiuse field "k" is an offset to struct seccomp_data. struct seccomp_data contains four components [1]: syscall number, architecture number, instruction pointer at the time of syscall, and syscall arguments. The syscall number is enumerated by the emulator. The arch number is treated by the cache as 'if arch number is different from cached arch number, flush cache' (this is in seccomp_cache_check). The last two (ip and args) are treated exactly the same way in this patch: if the filter loads the arguments at all, the syscall is marked non-cacheable for any architecture number. The struct seccomp_data is the only external thing the filter may access. It is also cBPF so it cannot contain maps to store special states between runs. Therefore a seccomp filter is pure function. If we know given some inputs (the syscall number and arch number) the function will not evaluate any other inputs before returning, then we can safely cache with just the inputs in concern. As for the overhead, on my x86_64 with gcc 10.2.0, seccomp_cache_check compiles into: if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls)) return false; 0xffffffff8120fdb3 <+99>: movsxd rdx,DWORD PTR [r12] 0xffffffff8120fdb7 <+103>: cmp edx,0x1b7 0xffffffff8120fdbd <+109>: ja 0xffffffff8120fdf9 <__seccomp_filter+169> if (unlikely(thread_data->last_filter != sfilter || thread_data->last_arch != sd->arch)) { 0xffffffff8120fdbf <+111>: mov rdi,QWORD PTR [rbp-0xb8] 0xffffffff8120fdc6 <+118>: lea rsi,[rax+0x6f0] 0xffffffff8120fdcd <+125>: cmp rdi,QWORD PTR [rax+0x728] 0xffffffff8120fdd4 <+132>: jne 0xffffffff812101f0 <__seccomp_filter+1184> 0xffffffff8120fdda <+138>: mov ebx,DWORD PTR [r12+0x4] 0xffffffff8120fddf <+143>: cmp DWORD PTR [rax+0x730],ebx 0xffffffff8120fde5 <+149>: jne 0xffffffff812101f0 <__seccomp_filter+1184> return test_bit(syscall_nr, thread_data->syscall_ok); 0xffffffff8120fdeb <+155>: bt QWORD PTR [rax+0x6f0],rdx 0xffffffff8120fdf3 <+163>: jb 0xffffffff8120ffb7 <__seccomp_filter+615> [... unlikely path of cache flush omitted] and seccomp_cache_insert compiles into: if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls)) return; 0xffffffff8121021b <+1227>: movsxd rax,DWORD PTR [r12] 0xffffffff8121021f <+1231>: cmp eax,0x1b7 0xffffffff81210224 <+1236>: ja 0xffffffff8120ffb7 <__seccomp_filter+615> if (!test_bit(syscall_nr, sfilter->cache.syscall_ok)) return; 0xffffffff8121022a <+1242>: mov rbx,QWORD PTR [rbp-0xb8] 0xffffffff81210231 <+1249>: mov rdx,QWORD PTR gs:0x17000 0xffffffff8121023a <+1258>: bt QWORD PTR [rbx+0x108],rax 0xffffffff81210242 <+1266>: jae 0xffffffff8120ffb7 <__seccomp_filter+615> set_bit(syscall_nr, thread_data->syscall_ok); 0xffffffff81210248 <+1272>: lock bts QWORD PTR [rdx+0x6f0],rax 0xffffffff81210251 <+1281>: jmp 0xffffffff8120ffb7 <__seccomp_filter+615> In the circumstance of a non-cacheable syscall happening over and over, the code path would go through the syscall_nr bound check, then the filter flush check, then the test_bit, then another syscall_nr bound check and one more test_bit in seccomp_cache_insert. Considering that they are either stack variables, elements of current task_struct, and elements of the filter struct, I imagine they would well be in the CPU data cache and not incur much overhead. The CPU is also free to branch predict and reorder memory accesses (there are no hardware memory barriers here) to further increase the efficiency, whereas a normal filter execution would be impaired by things like retpoline. If one were to add an additional flag for does-userspace-want-us-to-cache, it would still be a member of the filter struct. What would be loaded into the CPU data cache originally would still be loaded. Correct me if I'm wrong, but I don't think that check will reduce any significant overhead of the seccomp cache itself. That said, I have not profiled the exact number of milliseconds this patch would incur to uncacheable syscalls, I can report back with numbers if you would like to see. Does that answer your concern? YiFei Zhu [1] https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/seccomp.h#L60