On Thu, Feb 5, 2015 at 6:38 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: > On Thu, Feb 5, 2015 at 6:32 PM, Dmitry V. Levin <ldv@xxxxxxxxxxxx> wrote: >> On Thu, Feb 05, 2015 at 04:09:06PM -0800, Andy Lutomirski wrote: >>> On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote: >>> > On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@xxxxxxxxxxxx> wrote: >> [...] >>> >> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used >>> >> to keep the syscall number unchanged and suppress syscall-exit-stop event, >>> >> which was awful because userspace cannot distinguish syscall-enter-stop >>> >> from syscall-exit-stop and therefore relies on the kernel that >>> >> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.). >>> >> >>> >> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop >>> >> events to be suppressed, but now the syscall number is lost. >>> > >>> > Ah-ha! Okay, thanks, I understand now. I think this means seccomp >>> > phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you >>> > think here? >>> >>> I still don't quite see how this change caused this. >> >> I have a test for this at >> http://sourceforge.net/p/strace/code/ci/HEAD/~/tree/test/seccomp.c >> >>> I can play with >>> it a bit more. But RET_ERRNO *has* to be some kind of skip event, >>> because it needs to skip the syscall. >>> >>> We could change this by treating RET_ERRNO as an instruction to enter >>> phase 2 and then asking for a skip in phase 2 without changing >>> orig_ax, but IMO this is pretty ugly. >>> >>> I think this all kind of sucks. We're trying to run ptrace after >>> seccomp, so ptrace is seeing the syscalls as transformed by seccomp. >>> That means that if we use RET_TRAP, then ptrace will see the >>> possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO >>> correctly given the current design) showing syscall -1, and if we use >>> RET_KILL, then ptrace just sees the process mysteriously die. >> >> Userspace is usually not prepared to see syscall -1. >> For example, strace had to be patched, otherwise it just skipped such >> syscalls as "not a syscall" events or did other improper things: >> http://sourceforge.net/p/strace/code/ci/c3948327717c29b10b5e00a436dc138b4ab1a486 >> http://sourceforge.net/p/strace/code/ci/8e398b6c4020fb2d33a5b3e40271ebf63199b891 >> > > The x32 thing is a legit ABI bug, I'd argue. I'd be happy to submit a > patch to fix that (clear the x32 bit if we're not x32). > >> A slightly different but related story: userspace is also not prepared >> to handle large errno values produced by seccomp filters like this: >> BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | SECCOMP_RET_DATA) >> >> For example, glibc assumes that syscalls do not return errno values greater than 0xfff: >> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h#l55 >> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/syscall.S#l20 To save others the link reading: "Linus said he will make sure the no syscall returns a value in -1 .. -4095 as a valid result so we can savely test with -4095." Strictly speaking (ISO C, "man 3 errno"), errno is supposed to be a full int, though digging around I find this in include/linux/err.h: /* * Kernel pointers have redundant information, so we can use a * scheme where we can return either an error code or a normal * pointer with the same return value. * * This should be a per-architecture thing, to allow different * error and pointer decisions. */ #define MAX_ERRNO 4095 #ifndef __ASSEMBLY__ #define IS_ERR_VALUE(x) unlikely((x) >= (unsigned long)-MAX_ERRNO) But no architecture overrides this. >> If it isn't too late, I'd recommend changing SECCOMP_RET_DATA mask >> applied in SECCOMP_RET_ERRNO case from current 0xffff to 0xfff. I'm not opposed to this. I would want to more explicitly document the 4095 max value in man pages, though. > I think this is solidly in the "don't do that" category. Seccomp lets > you tamper with syscalls. If you tamper wrong, then you lose. > > Kees, what do you think about reversing the whole thing so that > ptracers always see the original syscall? What do you mean by "reversing"? The interactions I see here are: PTRACE_SYSCALL SECCOMP_RET_ERRNO SECCOMP_RET_TRACE SECCOMP_RET_TRAP Both ptrace and seccomp will trigger via _TIF_WORK_SYSCALL_ENTRY. Only ptrace will trigger via _TIF_WORK_SYSCALL_EXIT. For SECCOMP_RET_ERRNO to work, we must skip the syscall, as mentioned earlier: arch/x86/kernel/entry_32.S ... syscall_trace_entry: movl $-ENOSYS,PT_EAX(%esp) movl %esp, %eax call syscall_trace_enter /* What it returned is what we'll actually use. */ cmpl $(NR_syscalls), %eax jnae syscall_call jmp syscall_exit END(syscall_trace_entry) Both before and after the 2-phase change, syscall_trace_enter would return -1 if it hit SECCOMP_RET_ERRNO, before calling tracehook_report_syscall_entry. On exit, if PTRACE_SYSCALL, we'd hit tracehook_report_syscall_exit during syscall_trace_leave, which means a ptracer will see a syscall-exit-stop without a matching syscall-enter-stop. Using SECCOMP_RET_TRACE with PTRACE_SYSCALL in place seems totally crazy, as the ptracer would need to be the same program, and if it chose to skip a syscall, it would be in the same place: it would see PTRACE_EVENT_SECCOMP, then no syscall-enter-stop, then a syscall-exit-stop. I think we can ignore this pathological case. Using SECCOMP_RET_TRAP with PTRACE_SYSCALL also results in a skip, which produces the same "only syscall-exit-stop seen" problem. In the SECCOMP_RET_ERRNO case, the syscall nr doesn't change (and isn't executed). In the SECCOMP_RET_TRAP case, the syscall nr doesn't change (and isn't executed). In the SECCOMP_RET_TRACE, the syscall nr _could_ change, but the ptracer would be doing it, so the crazy situation around PTRACE_SYSCALL is probably safe to ignore (as long as we document what is expected to happen). So, the question is: should PTRACE_SYSCALL see a syscall that is _not_ being executed (due to seccomp)? Audit doesn't see it currently, and neither does ptrace. I would argue that it should continue to not see the syscall. That said, if it shouldn't be shown, we also shouldn't trigger syscall-exit-stop. If you can convince me it should see syscall-enter-stop, then I have two questions: 1) Do we accept that a ptracer can interfere with SECCOMP_RET_ERRNO? I think we probably must, since it can already interfere via syscall-exit-stop and change the errno. And especially since a ptracer can change syscalls during syscall-enter-stop to any syscall it wants, bypassing seccomp. This condition is already documented. 2) What do we do with audit? Suddenly we have ptrace seeing a syscall that audit doesn't? And an unrelated thought: 3) Can't we find some way to fix the inability of a ptracer to distinguish between syscall-enter-stop and syscall-exit-stop? -Kees -- Kees Cook Chrome OS Security