Re: [PATCH v5 2/3] seccomp_filters: system call filtering using BPF

Will Drewry <wad@xxxxxxxxxxxx> · Tue, 31 Jan 2012 05:04:33 -0600

On Mon, Jan 30, 2012 at 7:42 PM, Indan Zupancic <indan@xxxxxx> wrote:
> On Mon, January 30, 2012 23:26, Will Drewry wrote:
>> Do you think something along the lines of 2 kB is sane for a config-less change?
>
> Yes, especially if there is some way to get rid of it anyway,
> like disabling SECCOMP or some option under CONFIG_EMBEDDED.
> But it seems you need at least a hidden config option which
> depends on the stuff you need.

Disabling SECCOMP would definitely do it.

>>
>> Doing exactly that.  I've been tinkering with the best way to minimize
>> the impact to the existing BPF evaluator.  Right now, I'm adding a
>> very small number of new instructions to the sk_buff specific code
>> path, but I haven't yet benchmarked - just disasssembled.
>
> I would do all the checking in sk_chk_filter(), so you know that when
> you do run the filter, you can't hit the sk_buff paths. This doesn't
> cause any slow down for the networking path.

Ah sorry - I was referring to the intrusion of a load_pointer function
pointer.  I want to leave the current networking path as untouched as
possible.  For checking, I agree -- a quick change to sk_chk_filter or
even just a small helper function that scans for any codes in the
ancillary range will do the trick.

>> I agree. I will post the next series with a proposed integration. If
>> there is a lot of resistance, then the difference will be going from a
>> 2kB changes to a 3kB change.
>
> Let's see how it goes.
>
>>> I think you should go on a quest to make sure (almost) all archs have that file,
>>> before this patch can be merged. At least the archs that have ptrace support.
>>
>> I'm an idiot.  CONFIG_HAVE_ARCH_TRACEHOOK covers asm/syscall.h
>>
>> So I have two choices:
>> 1. allow seccomp with filtering on these platforms by fail if an
>> argument is accessed
>> 2. return ENOSYS when a filter is attempted to be installed on
>> platforms with no tracehook support.
>
> I vote for:
>
> 3. Add tracehook support to all archs.

I don't see these #3 as mutually exclusive :)  tracehook requires:
- task_pt_regs()          in asm/processor.h or asm/ptrace.h
- arch_has_single_step()  if there is hardware single-step support
- arch_has_block_step()   if there is hardware block-step support
- asm/syscall.h           supplying asm-generic/syscall.h interface
- linux/regset.h          user_regset interfaces
- CORE_DUMP_USE_REGSET    #define'd in linux/elf.h
-TIF_SYSCALL_TRACE       calls tracehook_report_syscall_{entry,exit}
- TIF_NOTIFY_RESUME       calls tracehook_notify_resume()
- signal delivery         calls tracehook_signal_handler()

> Maybe not all archs, but at least some more. That way, every time someone
> adds something tracehook specific, more archs support it.

Well the other arch I want this on specifically for my purposes is
arm, but someone recently posted a partial asm/syscall.h for arm, but
I didn't see that one go anywhere just yet.  (I know syscall_get_nr
can be tricky on arm because it doesn't (didn't) have a way of
tracking in-syscall state.)

ref: https://lkml.org/lkml/2011/12/1/131

> syscall.h has no TRACEHOOK defines or anything though.

Nope - it is just part of what is expected.

> Only syscall_rollback() looks tricky. I have no clue what the difference
> between syscall_get_error() and syscall_get_return_value() is. But you
> only need to add syscall_get_nr() and syscall_[gs]et_arguments(), which
> should be possible for all archs.

It seems even syscall_get_nr can have some wrinkles :)

ref: http://lkml.indiana.edu/hypermail/linux/kernel/0906.3/00096.html

> How many archs don't support tracehook?

14 out of 26.  However, 5 of those still have asm/syscall.h

>> I think #2 is the nicest user contract, but #1 allows working filters
>> even on less hospitable arches even if they can't user arguments
>> (yet).  I'm coding it up as #2, but it is easy to switch to #1.
>
> If you don't support the arch, don't compile any code at all, and
> let prctl(2) return EINVAL. You don't want to return ENOSYS.

I was thinking of the inline prctl handler, but EINVAL makes sense.

>>> Yeah, I figured that out later on. It's quite nifty, but I find the recursion
>>> within kref_put() slightly worrisome. Perhaps the code would be cleaner if this
>>> was avoided and done differently, but I can't think of a good alternative. I'll
>>> wait for the new version to see if I can find a way.
>>
>> Thanks - sure.  Since kref_put is just an atomic_dec_and_test followed
>> by a call to the function pointer, I wasn't too concerned.  Without
>> changing how the relationships are handled, I'm not sure how to
>> approach it differently (and still avoid races). IIRC, this isn't much
>> different to how namespaces work, they just use their own atomic
>> counters.
>
> Well, the thing is, this recursion is controlled by user space depending
> on how many filters they have installed. What is preventing them to force
> you out of stack?

Hrm true.  The easiest option is to just convert it to iterative by
not using kref_t, but I'll look more closely.

> So perhaps add at least some arbitrary filter limit to avoid this?

Definitely possible -- probably as a sysctl.  I'm not quite sure what
number makes sense yet, but I'll look at breaking the recursion first.
 Thanks!

>> I'll clarify a bit.  My original ptrace integration worked such that a
>> tracer may only intercept syscall failures if it attached prior to the
>> failing filter being installed.  I did it this way to avoid using
>> ptrace to escape system call filtering.  However, since I don't have
>> that as part of the patch series, it doesn't make sense to keep it. (I
>> tracked a tracer pid_struct in the filters.)  If it needs to come back
>> with later patchsets, then it can be tackled then!
>
> The problem of that is that filters can be shared between processes with
> different ptracers. And you have all the hassle of keeping it up to date.
>
> I think seccomp should always come first and just trust ptrace. This
> because it can deny all ptrace() calls for filtered tasks, so the only
> untrusted tasks doing ptrace() are outside of seccomp's filtering control.
> And those could do the same system calls themselves.
>
> The case where there is one task being filtered and allowed to do ptrace,
> but not some other syscall, ptracing another filtered task which isn't
> allowed to do ptrace, but allowed to do that other syscall, is quite far
> fetched I think. If you really want to handle this, then you could run
> the ptracer's filters against the tracee's post-ptrace syscall state.
> This is best done in the ptracer's context, just before continuing the
> system call. (You really want Oleg's SIKILL immediate patch then.)
>
> What about:
>
> 1) Seccomp filters can deny a syscall by killing the task.
>
> 2) Seccomp can deny a syscall and let it return an error value.
>
>   I know you're not fond of this one. It's just a performance
>   optimisation as sometimes a lot of denied but harmless syscalls
>   are called by glibc all the time, like getxattr(). Hardcoding
>   the kill policy seems wrong when it can be avoided. If this is
>   too hard then skip it, but if it's easy to add then please do.
>   I'll take a look at this with your next patch version.

It's easy on x86 harder on other arches.  I would suggest saving
changing the __secure_computing signature until after the core
functionality lands, but that's just me.

> 3) Seccomp can allow a syscall to proceed normally.
>
> 4) Seccomp can set a hint to skip ptrace syscall events for this syscall.
>   A filter sets this by returning a specific value.
>
> 5) Ptrace always gets a syscall event when it asked for it.
>
> 6) Ptrace can set an option to honour seccomp's hint and to not get all
>   syscall events.
>
> This way all seccomp needs to do is to set some flags which ptrace can check.

I like the use of flags/options to trigger ptrace handling.  If I were
to stack rank these for pursuit after the core functionality lands,
it'd be to add #6 (and its deps) then #2.  With #6, #2 can be
simulated (by having a supervisor that changes the syscall number to
-1), but that is much less ideal than just returning SECCOMP_ERROR
instead of SECCOMP_ALLOW/DENY and letting an error code get bubbled
up.

thanks!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html