Re: [RFC PATCH bpf-next seccomp 00/12] eBPF seccomp filters

Jinghao Jia <jinghao7@xxxxxxxxxxxx> · Wed, 9 Jun 2021 01:27:40 -0500

On 5/20/21 3:56 AM, Christian Brauner wrote:
On Thu, May 20, 2021 at 03:16:10AM -0500, Tianyin Xu wrote:
On Mon, May 17, 2021 at 10:40 AM Tycho Andersen <tycho@tycho.pizza> wrote:
On Sun, May 16, 2021 at 03:38:00AM -0500, Tianyin Xu wrote:
On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@xxxxxxxxxx> wrote:
On 5/10/21 10:21 PM, YiFei Zhu wrote:
On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote:
On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@xxxxxxxxx> wrote:
From: YiFei Zhu <yifeifz2@xxxxxxxxxxxx>

Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$

This patchset enables seccomp filters to be written in eBPF.
Supporting eBPF filters has been proposed a few times in the past.
The main concerns were (1) use cases and (2) security. We have
identified many use cases that can benefit from advanced eBPF
filters, such as:
I haven't reviewed this carefully, but I think we need to distinguish
a few things:

1. Using the eBPF *language*.

2. Allowing the use of stateful / non-pure eBPF features.

3. Allowing the eBPF programs to read the target process' memory.

I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
even less convinced by (3).

   * exec-only-once filter / apply filter after exec
This is (2).  I'm not sure it's a good idea.
The basic idea is that for a container runtime it may wait to execute
a program in a container without that program being able to execve
another program, stopping any attack that involves loading another
binary. The container runtime can block any syscall but execve in the
exec-ed process by using only cBPF.

The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
@Andrea and @Giuseppe, could you clarify more in case I missed
something?
We've discussed having a notifier-using filter be able to replace its
filter.  This would allow this and other use cases without any
additional eBPF or cBPF code.

A notifier is not always a solution (even ignoring its perf overhead).

One problem, pointed out by Andrea Arcangeli, is that notifiers need
userspace daemons. So, it can hardly be used by daemonless container
engines like Podman.
I'm not sure I buy this argument. Podman already has a conmon instance
for each container, this could be a child of that conmon process, or
live inside conmon itself.

Tycho
I checked with Andrea Arcangeli and Giuseppe Scrivano who are working on Podman.

You are right that Podman is not completely daemonless. However, “the
fact it's no entirely daemonless doesn't imply it's a good idea to
make it worse and to add complexity to the background conmon daemon or
to add more daemons.”

TL;DR. User notifiers are surely more flexible, but are also more
expensive and complex to implement, compared with ebpf filters. /*
I’ll reply to Sargun’s performance argument in a separate email */

I'm sure you know Podman well, but let me still move some jade from
Andrea and Giuseppe (all credits on podmon/crun are theirs) to
elaborate the point, for folks cced on the list who are not very
familiar with Podman.

Basically, the current order goes as follows:

          podman -> conmon -> crun -> container_binary
                                \
                                 - seccomp done at crun level, not conmon

At runtime, what's left is:

          conmon -> container_binary  /* podman disappears; crun disappears */

So, to go through and use seccomp notify to block `exec`, we can
either start the container_binary with a seccomp agent wrapper, or
bloat the common binary (as pointed out by Tycho).

If we go with the first approach, we will have:

          podman -> conmon -> crun -> seccomp_agent -> container_binary

So, at runtime we'd be left with one more daemon:

         conmon -> seccomp_agent -> container_binary
That seems like a strawman. I don't see why this has to be out of
process or a separate daemon. Conmon uses a regular event loop. Adding
support for processing notifier syscall notifications is
straightforward. Moving it to a plugin as you mentioned below is a
design decision not a necessity.

Apparently, nobody likes one more daemon. So, the proposal from
I'm not sure such a blanket statements about an indeterminate group of
people's alleged preferences constitutes a technical argument wny we
need ebpf in seccomp.

Giuseppe was/is to use user notifiers as plugins (.so) loaded by
conmon:
https://urldefense.com/v3/__https://github.com/containers/conmon/pull/190__;!!DZ3fjg!qjoih4kOsHD09Yg41YKmYQrW_YhB3AzV0sgWZsRK621KIf7eTKiMMhAiew-ySWA_vbUt$
https://urldefense.com/v3/__https://github.com/containers/crun/pull/438__;!!DZ3fjg!qjoih4kOsHD09Yg41YKmYQrW_YhB3AzV0sgWZsRK621KIf7eTKiMMhAiew-ySfWBbnxD$

Now, with the ebpf filter support, one can implement the same thing
using an embarrassingly simple ebpf filter and, thanks to Giuseppe,
this is well supported by crun.
So I think this is trying to jump the gun by saying "Look, the result
might be simpler.". That may even be the case - though I'm not yet
convinced - but Andy's point stands that this brings a slew of issues on
the table that need clear answers. Bringing stateful ebpf features into
seccomp is a pretty big step and especially around the
privilege/security model it looks pretty handwavy right now.
For the privilege/security model, I assume that you are referring to a 
way to safely do unprivileged ebpf and to make sure the ebpf features 
can be used by seccomp securely.

In fact, the privilege model is carefully implemented in the patch set . 
As mentioned in the cover letter, we followed the security model of user 
notifier and ptrace in a way that our implementation is as restrictive 
as them. Let me elaborate:

1. We require no less privilege than Seccomp or eBPF individually, (e.g. 
filter loading and uses of BPF helpers)

2. The new seccomp_extended LSM hook restricts the use of advanced bpf 
features (maps and helpers). Only when the hook permits the access can 
filters use standard helpers. The LSM hook is implemented in Yama and 
uses ptrace_scope to determine whether to allow access. This is based on 
the idea of reduction to ptrace, as the eBPF filters can instrument the 
process similar to ptrace.

3. The tracing helpers require additional capabilities (CAP_BPF and 
CAP_PERFMON).

4. For user-memory reading, we require CAP_PTRACE to read memory of 
non-dumpable processes. If the capability is not fulfilled, the 
bpf_user_probe{,str} helper would return -EPERM. This is, again, 
reduction to ptrace.

We acknowledge the concerns about user namespace pointed out by Alexei 
Starovoitov. We are more than happy to roll out the solution in the V2 
patch.

Best,
Jinghao

Christian