On Fri, Feb 6, 2015 at 11:23 AM, Kees Cook <keescook@xxxxxxxxxxxx> wrote: > On Thu, Feb 5, 2015 at 6:38 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: >> On Thu, Feb 5, 2015 at 6:32 PM, Dmitry V. Levin <ldv@xxxxxxxxxxxx> wrote: >>> On Thu, Feb 05, 2015 at 04:09:06PM -0800, Andy Lutomirski wrote: >>>> On Thu, Feb 5, 2015 at 3:49 PM, Kees Cook <keescook@xxxxxxxxxxxx> wrote: >>>> > On Thu, Feb 5, 2015 at 3:39 PM, Dmitry V. Levin <ldv@xxxxxxxxxxxx> wrote: >>> [...] >>>> >> There is a clear difference: before these changes, SECCOMP_RET_ERRNO used >>>> >> to keep the syscall number unchanged and suppress syscall-exit-stop event, >>>> >> which was awful because userspace cannot distinguish syscall-enter-stop >>>> >> from syscall-exit-stop and therefore relies on the kernel that >>>> >> syscall-enter-stop is followed by syscall-exit-stop (or tracee's death, etc.). >>>> >> >>>> >> After these changes, SECCOMP_RET_ERRNO no longer causes syscall-exit-stop >>>> >> events to be suppressed, but now the syscall number is lost. >>>> > >>>> > Ah-ha! Okay, thanks, I understand now. I think this means seccomp >>>> > phase1 should not treat RET_ERRNO as a "skip" event. Andy, what do you >>>> > think here? >>>> >>>> I still don't quite see how this change caused this. >>> >>> I have a test for this at >>> http://sourceforge.net/p/strace/code/ci/HEAD/~/tree/test/seccomp.c >>> >>>> I can play with >>>> it a bit more. But RET_ERRNO *has* to be some kind of skip event, >>>> because it needs to skip the syscall. >>>> >>>> We could change this by treating RET_ERRNO as an instruction to enter >>>> phase 2 and then asking for a skip in phase 2 without changing >>>> orig_ax, but IMO this is pretty ugly. >>>> >>>> I think this all kind of sucks. We're trying to run ptrace after >>>> seccomp, so ptrace is seeing the syscalls as transformed by seccomp. >>>> That means that if we use RET_TRAP, then ptrace will see the >>>> possibly-modified syscall, if we use RET_ERRNO, then ptrace is (IMO >>>> correctly given the current design) showing syscall -1, and if we use >>>> RET_KILL, then ptrace just sees the process mysteriously die. >>> >>> Userspace is usually not prepared to see syscall -1. >>> For example, strace had to be patched, otherwise it just skipped such >>> syscalls as "not a syscall" events or did other improper things: >>> http://sourceforge.net/p/strace/code/ci/c3948327717c29b10b5e00a436dc138b4ab1a486 >>> http://sourceforge.net/p/strace/code/ci/8e398b6c4020fb2d33a5b3e40271ebf63199b891 >>> >> >> The x32 thing is a legit ABI bug, I'd argue. I'd be happy to submit a >> patch to fix that (clear the x32 bit if we're not x32). >> >>> A slightly different but related story: userspace is also not prepared >>> to handle large errno values produced by seccomp filters like this: >>> BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | SECCOMP_RET_DATA) >>> >>> For example, glibc assumes that syscalls do not return errno values greater than 0xfff: >>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/sysdep.h#l55 >>> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86_64/syscall.S#l20 > > To save others the link reading: "Linus said he will make sure the no > syscall returns a value in -1 .. -4095 as a valid result so we can > savely test with -4095." > > Strictly speaking (ISO C, "man 3 errno"), errno is supposed to be a > full int, though digging around I find this in include/linux/err.h: > > /* > * Kernel pointers have redundant information, so we can use a > * scheme where we can return either an error code or a normal > * pointer with the same return value. > * > * This should be a per-architecture thing, to allow different > * error and pointer decisions. > */ > #define MAX_ERRNO 4095 > > #ifndef __ASSEMBLY__ > > #define IS_ERR_VALUE(x) unlikely((x) >= (unsigned long)-MAX_ERRNO) > > But no architecture overrides this. > >>> If it isn't too late, I'd recommend changing SECCOMP_RET_DATA mask >>> applied in SECCOMP_RET_ERRNO case from current 0xffff to 0xfff. > > I'm not opposed to this. I would want to more explicitly document the > 4095 max value in man pages, though. > >> I think this is solidly in the "don't do that" category. Seccomp lets >> you tamper with syscalls. If you tamper wrong, then you lose. >> >> Kees, what do you think about reversing the whole thing so that >> ptracers always see the original syscall? > > What do you mean by "reversing"? The interactions I see here are: > > PTRACE_SYSCALL > SECCOMP_RET_ERRNO > SECCOMP_RET_TRACE > SECCOMP_RET_TRAP > > Both ptrace and seccomp will trigger via _TIF_WORK_SYSCALL_ENTRY. Only > ptrace will trigger via _TIF_WORK_SYSCALL_EXIT. > > For SECCOMP_RET_ERRNO to work, we must skip the syscall, as mentioned earlier: > > arch/x86/kernel/entry_32.S ... > syscall_trace_entry: > movl $-ENOSYS,PT_EAX(%esp) > movl %esp, %eax > call syscall_trace_enter > /* What it returned is what we'll actually use. */ > cmpl $(NR_syscalls), %eax > jnae syscall_call > jmp syscall_exit > END(syscall_trace_entry) > > Both before and after the 2-phase change, syscall_trace_enter would > return -1 if it hit SECCOMP_RET_ERRNO, before calling > tracehook_report_syscall_entry. On exit, if PTRACE_SYSCALL, we'd hit > tracehook_report_syscall_exit during syscall_trace_leave, which means > a ptracer will see a syscall-exit-stop without a matching > syscall-enter-stop. > > Using SECCOMP_RET_TRACE with PTRACE_SYSCALL in place seems totally > crazy, as the ptracer would need to be the same program, and if it > chose to skip a syscall, it would be in the same place: it would see > PTRACE_EVENT_SECCOMP, then no syscall-enter-stop, then a > syscall-exit-stop. I think we can ignore this pathological case. > > Using SECCOMP_RET_TRAP with PTRACE_SYSCALL also results in a skip, > which produces the same "only syscall-exit-stop seen" problem. > > In the SECCOMP_RET_ERRNO case, the syscall nr doesn't change (and > isn't executed). In the SECCOMP_RET_TRAP case, the syscall nr doesn't > change (and isn't executed). In the SECCOMP_RET_TRACE, the syscall nr > _could_ change, but the ptracer would be doing it, so the crazy > situation around PTRACE_SYSCALL is probably safe to ignore (as long as > we document what is expected to happen). > > So, the question is: should PTRACE_SYSCALL see a syscall that is _not_ > being executed (due to seccomp)? Audit doesn't see it currently, and > neither does ptrace. I would argue that it should continue to not see > the syscall. That said, if it shouldn't be shown, we also shouldn't > trigger syscall-exit-stop. If you can convince me it should see > syscall-enter-stop, then I have two questions: I think PTRACE_SYSCALL should see syscalls that are skipped due to seccomp. I think that the exit event should see the modified errno, if any, so that strace will show whatever the traced process thinks is happening. > > 1) Do we accept that a ptracer can interfere with SECCOMP_RET_ERRNO? I > think we probably must, since it can already interfere via > syscall-exit-stop and change the errno. I think this is fine. > And especially since a ptracer > can change syscalls during syscall-enter-stop to any syscall it wants, > bypassing seccomp. This condition is already documented. If a ptracer (using PTRACE_SYSCALL) were to get the entry callback before seccomp, then this oddity would go away, which might be a good thing. A ptracer could change the syscall, but seccomp would based on what the ptracer changed the syscall to. > > 2) What do we do with audit? Suddenly we have ptrace seeing a syscall > that audit doesn't? Is this a problem? I'd be amazed if program uses both ptrace and audit -- after all, audit is a global thing, and it only has one implementation (AFAIK): auditd. auditd doesn't ptrace the world. > > And an unrelated thought: > > 3) Can't we find some way to fix the inability of a ptracer to > distinguish between syscall-enter-stop and syscall-exit-stop? > Couldn't we add PTRACE_O_TRACESYSENTRY and PTRACE_O_TRACESYSEXIT along the lines of PTRACE_O_TRACESYSGOOD? --Andy