Re: [RFC,PATCH 1/2] seccomp_filters: system call filtering using BPF

Will Drewry <wad@xxxxxxxxxxxx> · Thu, 12 Jan 2012 12:03:48 -0600

On Thu, Jan 12, 2012 at 11:57 AM, Jamie Lokier <jamie@xxxxxxxxxxxxx> wrote:
> Will Drewry wrote:
>> On Thu, Jan 12, 2012 at 11:22 AM, Jamie Lokier <jamie@xxxxxxxxxxxxx> wrote:
>> > Will Drewry wrote:
>> >> On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>> >> > On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>> >> >
>> >> >> Filter programs may _only_ cross the execve(2) barrier if last filter
>> >> >> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> >> >> user namespace.  Once a task-local filter program is attached from a
>> >> >> process without privileges, execve will fail.  This ensures that only
>> >> >> privileged parent task can affect its privileged children (e.g., setuid
>> >> >> binary).
>> >> >
>> >> > This means that a non privileged user can not run another program with
>> >> > limited features? How would a process exec another program and filter
>> >> > it? I would assume that the filter would need to be attached first and
>> >> > then the execv() would be performed. But after the filter is attached,
>> >> > the execv is prevented?
>> >>
>> >> Yeah - it means tasks can filter themselves, but not each other.
>> >> However, you can inject a filter for any dynamically linked executable
>> >> using LD_PRELOAD.
>> >>
>> >> > Maybe I don't understand this correctly.
>> >>
>> >> You're right on.  This was to ensure that one process didn't cause
>> >> crazy behavior in another. I think Alan has a better proposal than
>> >> mine below.  (Goes back to catching up.)
>> >
>> > You can already use ptrace() to cause crazy behaviour in another
>> > process, including modifying registers arbitrarily at syscall entry
>> > and exit, aborting and emulating syscalls.
>> >
>> > ptrace() is quite slow and it would be really nice to speed it up,
>> > especially for trapping a small subset of syscalls, or limiting some
>> > kinds of access to some file descriptors, while everything else runs
>> > at normal speed.
>> >
>> > Speeding up ptrace() with BPF filters would be a really nice.  Not
>> > that I like ptrace(), but sometimes it's the only thing you can rely on.
>> >
>> > LD_PRELOAD and code running in the target process address space can't
>> > always be trusted in some contexts (e.g. the target process may modify
>> > the tracing code or its data); whereas ptrace() is pretty complete and
>> > reliable, if ugly.
>> >
>> > There's already a security model around who can use ptrace(); speeding
>> > it up needn't break that.
>> >
>> > If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
>> > needed as userspace could have done it, with exactly the restrictions
>> > it wants.  Google's NaCl comes to mind as a potential user.
>>
>> That's not entirely true.  ptrace supervisors are subject to races and
>> always fail open.  This makes them effective but not as robust as a
>> seccomp solution can provide.
>
> What races do you know about?

I'm pretty sure that if you have two "isolated" processes, they could
cause irregular behavior using signals.

> I'm not aware of any ptrace races if it's used properly.  I'm also not
> sure what you mean by fail open/close here, unless you mean the target
> process gets to carry on if the tracing process dies.

Exactly.  Security systems that, on failure, allow the action to
proceed can't be relied on.

> Having said that, I can think of one race, but I think your BPF scheme
> has the same one: After checking the syscall's string arguments and
> other pointed to data, another thread can change those arguments
> before the real syscall uses them.

Not a problem - BPF only allows register inspection. No TOCTOU attacks
need apply :D

>> With seccomp, it fails close.  What I think would make sense would be
>> to add a user-controllable failure mode with seccomp bpf that calls
>> tracehook_ptrace_syscall_entry(regs).  I've prototyped this and it
>> works quite well, but I didn't want to conflate the discussions.
>
> It think it's a nice idea.  While you're at it could you fix all the
> architectures to actually use tracehooks for syscall tracing ;-)
>
> (I think it's ok to call the tracehook function on all archs though.)
>
>> Using ptrace() would also mean that all consumers of this interface
>> would need a supervisor, but with seccomp, the filters are installed
>> and require no supervisors to stick around for when failure occurs.
>>
>> Does that make sense?
>
> It does, I agree that ptrace() is quite cumbersome and you don't
> always want a separate tracing process, especially if "failure" means
> to die or get an error.
>
> On the other hand, sometimes when a failure occurs, having another
> process decide what to do, or log the event, is exactly what you want.
>
> For my nefarious purposes I'm really just looking for a faster way to
> reliably trace some activities of individual processes, in particular
> tracking which files they access.  I'd rather not interfere with
> debuggers, so I'd really like your ability to stack multiple filters
> to work with separate-process tracing as well.  And I'd happily use a
> filter rule which can dump some information over a pipe, without
> waiting for the tracer to respond in most cases.

Cool - if the rest of this discussion proceeds, then hopefully, we can
move towards discussing if tying it with ptrace is a good idea or a
horrible one :)

thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html