Re: [PATCH RESEND v3 bpf-next 01/14] bpf: introduce BPF token object

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 7 Jul 2023 16:58:40 -0700

On Fri, Jul 7, 2023 at 3:00 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes:
>
> > On Fri, Jul 7, 2023 at 6:04 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
> >>
> >> Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes:
> >>
> >> > On Thu, Jul 6, 2023 at 4:32 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
> >> >>
> >> >> Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes:
> >> >>
> >> >> > Having it as a separate single-purpose FS seems cleaner, because we
> >> >> > have use cases where we'd have one BPF FS instance created for a
> >> >> > container by our container manager, and then exposing a few separate
> >> >> > tokens with different sets of allowed functionality. E.g., one for
> >> >> > main intended workload, another for some BPF-based observability
> >> >> > tools, maybe yet another for more heavy-weight tools like bpftrace for
> >> >> > extra debugging. In the debugging case our container infrastructure
> >> >> > will be "evacuating" any other workloads on the same host to avoid
> >> >> > unnecessary consequences. The point is to not disturb
> >> >> > workload-under-human-debugging as much as possible, so we'd like to
> >> >> > keep userns intact, which is why mounting extra (more permissive) BPF
> >> >> > token inside already running containers is an important consideration.
> >> >>
> >> >> This example (as well as Yafang's in the sibling subthread) makes it
> >> >> even more apparent to me that it would be better with a model where the
> >> >> userspace policy daemon can just make decisions on each call directly,
> >> >> instead of mucking about with different tokens with different embedded
> >> >> permissions. Why not go that route (see my other reply for details on
> >> >> what I mean)?
> >> >
> >> > I don't know how you arrived at this conclusion,
> >>
> >> Because it makes it apparent that you're basically building a policy
> >> engine in the kernel with this...
> >
> > I disagree that this is a policy engine in the kernel. It's a building
> > block for delegation and enforcement. The policy itself is implemented
> > in user-space by a privileged process that decides when to issue BPF
> > tokens and of which configuration. And, optionally and if necessary,
> > further restricting using BPF LSM in a more fine-grained and dynamic
> > way.
>
> Right, and I'm saying that it's too coarse-grained to be a proper

CAP_BPF, CAP_PERFMON, CAP_SYS_ADMIN, CAP_NET_ADMIN are also very
coarse grained. And somehow we get by and make do with them outside of
the user namespace use case.

> building block in its own right. As evidenced by the need for adding an
> LSM on top to do anything fine-grained; a task which is decidedly

There is no *need* to add LSM. For tons of practical use cases you
won't need it. Yes, people will make a decision whether they even have
to bother with more fine grained controls. And if yes, LSM is there to
provide it.

> non-trivial to get right, BTW. Which means that the path of least
> resistance is going to be to just grant a token and not bother with the
> LSM, thus ending up with this being a giant foot gun from a security
> PoV.

If there is no need for LSM, yes, and I think it's totally acceptable.
It will be up to users to decide.

>
> >> > but we've debated BPF proxying and separate service at length, there
> >> > is no point in going on another round here.
> >>
> >> You had some objections to explicit proxying via RPC calls; I suggested
> >> a way of avoiding that by keeping the kernel in the loop, which you have
> >
> > I thought we settled the seccomp notify proposal?
>
> Your objection to that was that it was too much of a hack to read all
> the target process memory (etc) from the policy daemon, which I
> acknowledged and suggested a way of keeping the kernel in the loop so it
> can take responsibility for the gnarly bits while still allowing
> userspace to actually make the decision:
>

Your proposal for some new mechanism for blocking bpf() syscall to let
another user space process make decision and somehow provide all the
necessary data to make this decision without that process needing to
read original process' memory (so presumably kernel will make a copy
of BPF program instructions, BTF contents, all the strings, etc, etc?)
sounded more like a joke and just a contrarian way to provide *any*
alternative, just to disagree with the much simpler and more
straightforward proposal.

I encourage you to spend some time prototyping this new mechanism,
sending RFC and gathering community feedback before using this
handwavy idea as an excuse to block BPF token-like mechanism. I'll be
curious to read the discussion on how it's different from
authoritative LSM, seccomp notify, etc, etc.

> https://lore.kernel.org/r/87v8ezb6x5.fsf@xxxxxxx
>
> (Last two paragraphs). Maybe that message just got lost somewhere on its
> way to your inbox?
>
> >> not responded to. If you're just going to go ahead with your solution
> >> over any objections you could just have stated so from the beginning and
> >> saved us all a lot of time :/
> >
> > It would also be good to understand that yours is but one of the
> > opinions. If you read the thread carefully you'll see that other
> > people have differing opinions. And yours doesn't necessarily have to
> > be the deciding one.
> >
> > I appreciate the feedback, but I don't appreciate the expectation that
> > your feedback is binding in any way.
>
> I'm not expecting veto rights, I'm objecting to being ignored. The way

You are not being ignored. We are just disagreeing. There is a
difference. BPF proxying was discussed at length and people who manage
large sets of BPF applications voiced their concerns. Security
concerns you have for BPF token are just as applicable to CAP_BPF and
other caps. BPF token actually allows to drop those very
coarse-grained capabilities in a bunch of circumstances and overall
improve the security. Also note, there were security folks in the
discussion which seem to be fine with the BPF token approach, overall.

You don't like my (and others') answers. That's fine, but please don't
pretend like you are being ignored.

> this development process is *supposed* to work (as far as I'm concerned)
> is that someone proposes a patch series, the community provides
> feedback, and discussion proceeds until there's at least rough consensus
> that the solution we've arrived at is the right way forward.

Rough consensus, not 100% consensus, though?.. There will always be
someone who disagrees.

>
> If you're going to cut that process short and just pick and choose which

Yep, clearly, going into the 3rd month of discussions (starting from
LSF/MM, and I don't even include the authoritative LSM discussions
before that) is cutting this process very short, of course.

> comments are worth addressing and which are not, I can't stop you,
> obviously; but at least do me the favour of being up front about it so I
> can stop wasting my time trying to be constructive.

I wouldn't say that a proposal like "some seccomp-notify-like
mechanism to let another process decide if bpf() syscall should
proceed" with not much effort put into thinking about how it should be
done specifically and whether it's actually a better approach was very
constructive. And it felt self-evident that it's not a good way,
especially after Christian himself said that the seccomp-based
approach is also not a good generic solution. Your proposal was just a
weird bpf()-specific (and not very well specified) twist on the
seccomp notify idea. But as I said above, give it a try, perhaps I'm
mistaken and the BPF community would love the idea and implementation.

>
> Anyhow, I guess this point is moot for this discussion since I'm about
> to leave for vacation for four weeks and won't be able to follow up on
> this. Apologies for the bad timing :/ I'll ping some RH folks and try to
> get them to keep an eye on this while I'm away...

Enjoy your vacation!

>
> >> Can we at least put this thing behind a kconfig option, so we can turn
> >> it off in distro kernels?
> >
> > Why can't distro disable this in some more dynamic way, though? With
> > existing LSM mechanism, sysctl, whatever? I think it would be useful
> > to let users have control over this and decide for themselves without
> > having to rebuild a custom kernel.
>
> A sysctl similar to the existing one for unprivileged BPF would be fine
> as well. If an LSM ends up being the only way to control it, though,
> that will carry so much operational overhead for us to get to a working
> state that it'll most likely be simpler to just patch it out of the
> kernel.

Sounds good, I will add sysctl for the next version.

>
> -Toke
>