Re: [PATCH v10 05/11] seccomp: add system call filtering using BPF

"Indan Zupancic" <indan@xxxxxx> · Thu, 23 Feb 2012 01:25:06 +0100

On Wed, February 22, 2012 20:47, Will Drewry wrote:
> On Wed, Feb 22, 2012 at 2:19 AM, Indan Zupancic <indan@xxxxxx> wrote:
>> I highly disagree with every filter having to check the mode: Filters that
>> don't check the arch on e.g. x86 are buggy, so they have to check it, even
>> if it's a 32-bit or 64-bit only system, the filters can't know that and
>> needs to check the arch at every syscall entry. All other info in the data
>> depends on the arch, because of this there isn't much code to share between
>> the two archs, so you can as well have one filter for each arch.
>>
>> Alternative approach: Tell the arch at filter install time and only run the
>> filters with the same arch as the current system call. If no filters are run,
>> deny the systemcall.
>
> This was roughly how I first implemented compat and non-compat
> support.  It causes some implicit behavior across inheritance that is
> not nice though.

Same implicit behaviour Ben mentioned or something else?

Yeah, that's a bit of a problem. It can be solved within filters, but
it's starting to get more obscure than just checking the arch for every
syscall.

>
>> Advantages:
>>
>> - Filters don't have to check the arch every syscall entry.
>
> This I like.
>
>> - Secure by default. Filters don't have to do anything arch specific to
>>  be secure, no surprises possible.
>
> This is partially true, but it is exactly why I hid compat before.
>
>> - If a new arch comes into existence, there is no chance of old filters
>>  becoming buggy and insecure. This is especially true for archs that
>>  had only one mode, but added another one later on: Old filters had no
>>  need to check the mode at all.
>
> Perhaps.  A buggy filter that works on x86-64 might be exposed on a
> new x32 ABI.  It's hard to predict how audit_arch and the syscall abi
> will develop with new platforms.

It doesn't matter, if the filter assumes there are only two archs possible
and those two archs need different treatment, then the new arch at best
will only match one of them and it depends on which arch the filter checks
for. E.g. whether it does:

if (arch == AUDIT_ARCH_I386)
	...
else /* assume x86_64 */
	...

versus

if (arch == AUDIT_ARCH_X86_64)
	...
else /* assume i386 */
	...

>> - For kernels supporting only one arch, the check can be optimised away,
>>  by not installing unsupported arch filters at all.
>
> Somewhat.  Without having a dedicated arch helper, you'd have to guess
> that arches only support one or two arches (if compat is supported),
> but I don't know if that is a safe assumption to make.

Well, if you want to optimise all checks away, then you obviously need
arch helpers. Without it, you have to install all filters, even the ones
you'll never run.

>> It's more secure, faster and simpler for the filters.
>>
>> If something like this is implemented it's fine to expose the arch info
>> in the syscall data too, and have a way to install filters for all archs,
>> for the few cases where that might be useful, although I can't think of
>> any reason why people would like to do unnecessary work in the filters.
>
> It seems to just add complexity to support both. I think we'll
> probably end up with it in the filters for better or worse.  Possibly
> JITing will be useful since at least a 32-bit load and je is pretty
> cheap in native instructions.

Yeah, except that you can't easily do that because you don't have direct
access to the arch.

>> All that's needed is an extra argument to the prctl() call. I propose
>> 0 for the current arch, -1 for all archs and anything else to specify
>> the arch. Installing a filter for an unsupported arch could return
>> ENOEXEC.
>
> Without adding a per-arch call, there is no way to know all the
> supported arches at install time.  Current arch, at least, can be
> determined with a call to syscall_get_arch().

True.

> As is, I'm not sure it makes sense to try to reserve two extra input
> types: 0 and -1.  0 would be sane to treat as either a wildcard or
> current because it is unlikely to be used by AUDIT_ARCH_* ever since
> EM_NONE is assigned to 0.  However, I have no such insight into
> whether it will ever be possible to compose 0xffffffff as an
> AUDIT_ARCH_.

That seems impossible.

>> As far as the implementation goes, either have a list per supported arch
>> or store the arch per filter and check that before running the filter.
>
> You can't do it per arch without adding even more per-arch
> dependencies.  Keeping them annotated in the same list is the clearest
> way I've seen so far, but it comes with its own burdens.

You could have a list per installed arch, so there is no need to know all
supported archs, if you don't have the per arch helpers.

I don't see how keeping the arch in the filter itself comes with a burden,
that's what you were basically doing with the compat flag anyway.

But keeping the check within the filter is the simplest solution it seems,
so ignore my objections and just let's live with the extra hassle at the
user space side.

>> What use is the instruction pointer considering it tells nothing about
>> the call path?

My fear of exposing the IP is that people will erroneously assume that it
says anything about the call path, and hence write insecure security code
that's easily bypassed by just jumping to the right instruction address.
And if the vDSO is used, the IP will always be the same.

So what use is knowing the IP?

>> Wouldn't it make sense to allow going from mode 2 to 1?
>> After all, the filter could have blocked it if it didn't
>> want to permit it, and mode 1 is more restrictive than
>> mode 2.
>
> Nope - that might allow a downgrade that bypasses write/read
> restrictions.  E.g., a filter could only allow a read to a certain buf
> or of a certain size.  Allowing a downgrade would allow bypassing
> those filters, whether they are the most sane things or not :)

But now you enforce that decision while the filter could make that
choice itself instead. If the filter doesn't allow read() and write(),
I really doubt it would allow prctl().

>> Out of curiosity, did you measure the kernel size differences before and
>> after these patches? Would be sad if sharing it with the networking code
>> didn't reduce the actual kernel size.
>
> Oh yeah - it was a serious reduction.  Initially, seccomp_filter.o
> added 8kb by itself.  With the merged seccomp.o, continued code
> trimming (as suggested), and all the SECCOMP_RET_* variations, the
> total kernel growth is 2972 bytes for the same kernel config.  This is
> shared across ~2000 bytes in seccomp.o and ~800 bytes in filter.o.

Looks good, though the 800 extra bytes for filter.o seems high, it
used to be 292 bytes according to your email from January the 30th.
You said the run filter function added 861 bytes, so sharing doesn't
seem to reduce the kernel size any more?

Greetings,

Indan

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html