Re: [RFC PATCH bpf-next 00/13] bpf: Introduce BPF namespace

Yafang Shao <laoar.shao@xxxxxxxxx> · Tue, 4 Apr 2023 10:59:55 +0800

On Tue, Apr 4, 2023 at 6:50 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Mon, Apr 03, 2023 at 11:05:25AM +0800, Yafang Shao wrote:
> > On Mon, Apr 3, 2023 at 7:37 AM Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> > >
> > > On Tue, Mar 28, 2023 at 11:47:31AM +0800, Yafang Shao wrote:
> > > > On Tue, Mar 28, 2023 at 3:04 AM Song Liu <song@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > > > > >
> > > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> > > > > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > > > > prevent non-privilidged users from getting other users' bpf program, but
> > > > > > also it prevents the user from iterating his own bpf objects.
> > > > > >
> > > > > > In container environment, some users want to run bpf programs in their
> > > > > > containers. These users can run their bpf programs under CAP_BPF and
> > > > > > some other specific CAPs, but they can't inspect their bpf programs in a
> > > > > > generic way. For example, the bpftool can't be used as it requires
> > > > > > CAP_SYS_ADMIN. That is very inconvenient.
> > > > >
> > > > > Agreed that it is important to enable tools like bpftool without
> > > > > CAP_SYS_ADMIN. However, I am not sure whether we need a new
> > > > > namespace for this. Can we reuse some existing namespace for this?
> > > > >
> > > >
> > > > It seems we can't.
> > >
> > > Yafang,
> > >
> > > It's a Nack.
> > >
> > > The only thing you've been trying to "solve" with bpf namespace is to
> > > allow 'bpftool prog show' iterate progs in the "namespace" without CAP_SYS_ADMIN.
> > > The concept of bpf namespace is not even close to be thought through.
> >
> > Right, it is more likely a PoC in its current state.
> >
> > > Others pointed out the gaps in the design. Like bpffs. There are plenty.
> > > Please do not send patches like this in the future.
> >
> > The reason I sent it with an early state is that I want to get some
> > early feedback from the community ahead of the LSF/MM/BPF workshop,
> > then I can improve it based on these feedbacks and present it more
> > specifically at the workshop. Then the discussion will be more
> > effective.
> >
> > > You need to start with describing the problem you want to solve,
> > > then propose _several_ solutions, describe their pros and cons,
> > > solicit feedback, present at the conferences (like LSFMMBPF or LPC),
> > > and when the community agrees that 1. problem is worth solving,
> > > 2. the solution makes sense, only then work on patches.
> > >
> >
> > I would like to give a short discussion on the BPF namespace if
> > everything goes fine.
>
> Not in this shape of BPF namespace as done in this patch set.
> We've talked about BPF namespace in the past. This is not it.
>
> > > "In container environment, some users want to run bpf programs in their containers."
> > > is something Song brought up at LSFMMBPF a year ago.
> > > At that meeting most of the folks agreed that there is a need to run bpf
> > > in containers and make sure that the effect of bpf prog is limited to a container.
> > > This new namespace that creates virtual IDs for progs and maps doesn't come
> > > close in solving this task.
> >
> > Currently in our production environment, all the containers running
> > bpf programs are privileged, that is risky. So actually the goal of
> > the BPF namespace is to make them (or part of them) non-privileged.
> > But some of the abilities of these bpf programs will be lost in this
> > procedure, like the debug-bility with bpftool, so we need to fix it.
> > Agree with you that this goal is far from making bpf programs safely
> > running in a container environment.
>
> I disagree that allowing admin to run bpftool without sudo is a task
> worth solving. The visibility of bpf progs in a container is a different task.
> Without doing any kernel changes we can add a flag to bpftool to let
> 'bpftool prog show' list progs that were loaded by processes in the same cgroup.
> bpftool already does prog->pid mapping with bpf iterators.
> It can filter by cgroup just as well.

IIUC, at least we need bellow change in the kernel,

--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3705,9 +3705,6 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
        if (CHECK_ATTR(BPF_OBJ_GET_NEXT_ID) || next_id >= INT_MAX)
                return -EINVAL;

-       if (!capable(CAP_SYS_ADMIN))
-               return -EPERM;
-
        next_id++;
        spin_lock_bh(lock);
        if (!idr_get_next(idr, &next_id))

Because the container doesn't have CAP_SYS_ADMIN enabled, while they
only have CAP_BPF and other required CAPs.

Another possible solution is that we run an agent in the host, and the
user in the container who wants to get the bpf objects info in his
container should send a request to this agent via unix domain socket.
That is what we are doing now in our production environment.  That
said, each container has to run a client to get the bpf object fd.
There are some downsides,
-  It can't handle pinned bpf programs
   For pinned programs, the user can get them from the pinned files
directly, so he can use bpftool in his case, only with some
complaints.
-  If the user attached the bpf prog, and then removed the pinned
file, but didn't detach it.
   That happened. But this error case can't be handled.
- There may be other corner cases that it can't fit.

There's a solution to improve it, but we also need to change the
kernel. That is, we can use the wasted space btf->name.

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index b7e5a55..59d73a3 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -5542,6 +5542,8 @@ static struct btf *btf_parse(bpfptr_t btf_data,
u32 btf_data_size,
                err = -ENOMEM;
                goto errout;
        }
+       snprintf(btf->name, sizeof(btf->name), "%s-%d-%d", current->comm,
+                        current->pid, cgroup_id(task_cgroup(p, cpu_cgrp_id)));
        env->btf = btf;

        data = kvmalloc(btf_data_size, GFP_KERNEL | __GFP_NOWARN);


-- 
Regards
Yafang