Re: [PATCH v2 bpf-next 00/18] BPF token

"Andy Lutomirski" <luto@xxxxxxxxxx> · Thu, 22 Jun 2023 18:02:57 -0700

On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote:
> On Thu, Jun 22, 2023 at 10:38 AM Maryam Tahhan <mtahhan@xxxxxxxxxx> wrote:
>
> For CAP_BPF too broad. It is broad, yes. If you have good ideas how to
> break it down some more -- please propose. But this is all orthogonal,
> because the blocking problem is fundamental incompatibility of user
> namespaces (and their implied isolation and sandboxing of workloads)
> and BPF functionality, which is global by its very nature. The latter
> is unavoidable in principle.

How, exactly, is BPF global by its very nature?

The *implementation* has some issues with globalness.  Much of it should be fixable.

>
> No matter how much you break down CAP_BPF, you can't enforce that BPF
> program won't interfere with applications in other containers. Or that
> it won't "spy" on them. It's just not what BPF can enforce in
> principle.

The WHOLE POINT of the verifier is to attempt to constrain what BPF programs can and can't do.  There are bugs -- I get that.  There are helper functions that are fundamentally global.  But, in the absence of verifier bugs, BPF has actual boundaries to its functionality.

>
> So that comes back down to a question of trust and then controlled
> delegation of BPF functionality. You trust workload with BPF usage
> because you reviewed the BPF code, workload, testing, etc? Grant BPF
> token and let that container use limited subset of BPF. Employ BPF LSM
> to further restrict it beyond what BPF token can control.
>
> You cannot trust an application to not do something harmful? You
> shouldn't grant it either CAP_BPF in init namespace, nor BPF token in
> user namespace. That's it. Pick your poison.

I think what's lost here is hardening vs restricting intended functionality.

We have access control to restrict intended functionality.  We have other (and generally fairly ad-hoc and awkward) ways to flip off functionality because we want to reduce exposure to any bugs in it.

BPF needs hardening -- this is well established.  Right now, this is accomplished by restricting it to global root (effectively).  It should have access controls, too, but it doesn't.

>
> But all this cannot be mechanically decided or enforced. There has to
> be some humans involved in making these decisions. Kernel's job is to
> provide building blocks to grant and control BPF functionality to the
> extent that it is technically possible.
>

Exactly.  And it DOES NOT.  bpf maps, etc do not have sensible access controls.  Things that should not be global are global.  I'm saying the kernel should fix THAT.  Once it's in a state that it's at least credible to allow BPF in a user namespace, than come up with a way to allow it.

> As for "something to isolate the pinned maps/progs by different apps
> (why not DAC rules?)", there is no such thing, as I've explained
> already.
>
> I can install sched_switch raw_tracepoint BPF program (if I'm allowed
> to), and that program has system-wide observability. It cannot be
> bound to an application.

Great, a real example!

Either:

(a) don't run this in a container.  Have a service for the container to request the help of this program.

(b) have a way to have root approve a particular program and expose *that* program to the container, and let the program have its own access controls internally (e.g. only output info that belongs to that container).

> then what do we do when we switch from process A in container
> X to process B in container Y? Is that event belonging to container X?
> Or container Y?

I don't know, but you had better answer this question before you run this thing in a container, not just for security but for basic functionality.  If you haven't defined what your program is even supposed to do in a container, don't run it there.

> Hopefully you can see where I'm going with this. And this is just one
> random tiny example. We can think up tons of other cases to prove BPF
> is not isolatable to any sort of "container".

No.  You have not come up with an example of why BPF is not isolatable to a container.  You have come up with an example of why binding to a sched_switch raw tracepoint does not make sense in a container without additional mechanisms to give it well defined functionality and appropriate security.

Please stop conflating BPF (programs, maps, etc) with *attachments* of BPF programs to systemwide things.  They're both under the BPF umbrella.  They're not the same thing.

Passing a token into a container that allow that container to do things like loading its own programs *and attaching them to raw tracepoints* is IMO a complete nonstarter.  It makes no sense.