Re: [PATCH v6 6/9] kernel: entry: Support Syscall User Dispatch for common syscall entry

Andy Lutomirski <luto@xxxxxxxxxx> · Mon, 7 Sep 2020 13:20:23 -0700

On Mon, Sep 7, 2020 at 7:25 AM Christian Brauner
<christian.brauner@xxxxxxxxxx> wrote:
>
> On Mon, Sep 07, 2020 at 07:15:52AM -0700, Andy Lutomirski wrote:
> >
> >
> > > On Sep 7, 2020, at 3:15 AM, Christian Brauner <christian.brauner@xxxxxxxxxx> wrote:
> > >
> > > On Fri, Sep 04, 2020 at 04:31:44PM -0400, Gabriel Krisman Bertazi wrote:
> > >> Syscall User Dispatch (SUD) must take precedence over seccomp, since the
> > >> use case is emulation (it can be invoked with a different ABI) such that
> > >> seccomp filtering by syscall number doesn't make sense in the first
> > >> place.  In addition, either the syscall is dispatched back to userspace,
> > >> in which case there is no resource for seccomp to protect, or the
> > >
> > > Tbh, I'm torn here. I'm not a super clever attacker but it feels to me
> > > that this is still at least a clever way to circumvent a seccomp
> > > sandbox.
> > > If I'd be confined by a seccomp profile that would cause me to be
> > > SIGKILLed when I try do open() I could prctl() myself to do user
> > > dispatch to prevent that from happening, no?
> > >
> >
> > Not really, I think. The idea is that you didn’t actually do open().
> > You did a SYSCALL instruction which meant something else, and the
> > syscall dispatch correctly prevented the kernel from misinterpreting
> > it as open().
>
> Right, for the case where you're e.g. emulating windows syscalls that's
> true. I was thinking when you're running natively on Linux: couldn't I
> first load a seccomp profile "kill me if someone does an open()", then
> I exec() the target binary and that binary is setup to do
> prctl(USER_DISPATCH) first thing. I guess, it's ok because as far as I
> had time to read it this is a nothing or all mechanism, i.e. _all_
> system calls are re-routed in contrast to e.g. seccomp where I could do
> this per-syscall. So for user-dispatch it wouldn't make sense to use it
> on Linux per se. Still makes me a little uneasy. :)

There's an escape hatch, so processes using this can still make syscalls.

Maybe think about it another way: a process using user dispatch should
definitely *not* trigger seccomp user notifiers, errno returns, or
ptrace events, since they'll all do the wrong thing.  IMO RET_KILL is
the same.

Barring some very severe defect, there's no way a program can use user
dispatch to escape seccomp -- a program could use user dispatch to
allow them to do:

mov $__NR_open, %rax
syscall

without dying despite the presence of a filter that would kill the
process if it tried to do open(), but this doesn't bypass the filter
at all.  The process could just as easily have done:

mov $__NR_open
jmp magic_stub(%rip)

without tripping the filter, since no system call actually happens here.

--Andy