Re: [RFC PATCH bpf-next 00/13] bpf: Introduce BPF namespace

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 5 Apr 2023 19:06:56 -0700

On Tue, Apr 04, 2023 at 10:59:55AM +0800, Yafang Shao wrote:
> On Tue, Apr 4, 2023 at 6:50 AM Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > On Mon, Apr 03, 2023 at 11:05:25AM +0800, Yafang Shao wrote:
> > > On Mon, Apr 3, 2023 at 7:37 AM Alexei Starovoitov
> > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > >
> > > > On Tue, Mar 28, 2023 at 11:47:31AM +0800, Yafang Shao wrote:
> > > > > On Tue, Mar 28, 2023 at 3:04 AM Song Liu <song@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > > > > > >
> > > > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> > > > > > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > > > > > prevent non-privilidged users from getting other users' bpf program, but
> > > > > > > also it prevents the user from iterating his own bpf objects.
> > > > > > >
> > > > > > > In container environment, some users want to run bpf programs in their
> > > > > > > containers. These users can run their bpf programs under CAP_BPF and
> > > > > > > some other specific CAPs, but they can't inspect their bpf programs in a
> > > > > > > generic way. For example, the bpftool can't be used as it requires
> > > > > > > CAP_SYS_ADMIN. That is very inconvenient.
> > > > > >
> > > > > > Agreed that it is important to enable tools like bpftool without
> > > > > > CAP_SYS_ADMIN. However, I am not sure whether we need a new
> > > > > > namespace for this. Can we reuse some existing namespace for this?
> > > > > >
> > > > >
> > > > > It seems we can't.
> > > >
> > > > Yafang,
> > > >
> > > > It's a Nack.
> > > >
> > > > The only thing you've been trying to "solve" with bpf namespace is to
> > > > allow 'bpftool prog show' iterate progs in the "namespace" without CAP_SYS_ADMIN.
> > > > The concept of bpf namespace is not even close to be thought through.
> > >
> > > Right, it is more likely a PoC in its current state.
> > >
> > > > Others pointed out the gaps in the design. Like bpffs. There are plenty.
> > > > Please do not send patches like this in the future.
> > >
> > > The reason I sent it with an early state is that I want to get some
> > > early feedback from the community ahead of the LSF/MM/BPF workshop,
> > > then I can improve it based on these feedbacks and present it more
> > > specifically at the workshop. Then the discussion will be more
> > > effective.
> > >
> > > > You need to start with describing the problem you want to solve,
> > > > then propose _several_ solutions, describe their pros and cons,
> > > > solicit feedback, present at the conferences (like LSFMMBPF or LPC),
> > > > and when the community agrees that 1. problem is worth solving,
> > > > 2. the solution makes sense, only then work on patches.
> > > >
> > >
> > > I would like to give a short discussion on the BPF namespace if
> > > everything goes fine.
> >
> > Not in this shape of BPF namespace as done in this patch set.
> > We've talked about BPF namespace in the past. This is not it.
> >
> > > > "In container environment, some users want to run bpf programs in their containers."
> > > > is something Song brought up at LSFMMBPF a year ago.
> > > > At that meeting most of the folks agreed that there is a need to run bpf
> > > > in containers and make sure that the effect of bpf prog is limited to a container.
> > > > This new namespace that creates virtual IDs for progs and maps doesn't come
> > > > close in solving this task.
> > >
> > > Currently in our production environment, all the containers running
> > > bpf programs are privileged, that is risky. So actually the goal of
> > > the BPF namespace is to make them (or part of them) non-privileged.
> > > But some of the abilities of these bpf programs will be lost in this
> > > procedure, like the debug-bility with bpftool, so we need to fix it.
> > > Agree with you that this goal is far from making bpf programs safely
> > > running in a container environment.
> >
> > I disagree that allowing admin to run bpftool without sudo is a task
> > worth solving. The visibility of bpf progs in a container is a different task.
> > Without doing any kernel changes we can add a flag to bpftool to let
> > 'bpftool prog show' list progs that were loaded by processes in the same cgroup.
> > bpftool already does prog->pid mapping with bpf iterators.
> > It can filter by cgroup just as well.
> 
> IIUC, at least we need bellow change in the kernel,

No. The user should just 'sudo bpftool ...' instead.

> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -3705,9 +3705,6 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
>         if (CHECK_ATTR(BPF_OBJ_GET_NEXT_ID) || next_id >= INT_MAX)
>                 return -EINVAL;
> 
> -       if (!capable(CAP_SYS_ADMIN))
> -               return -EPERM;
> -
>         next_id++;
>         spin_lock_bh(lock);
>         if (!idr_get_next(idr, &next_id))
> 
> Because the container doesn't have CAP_SYS_ADMIN enabled, while they
> only have CAP_BPF and other required CAPs.
> 
> Another possible solution is that we run an agent in the host, and the
> user in the container who wants to get the bpf objects info in his
> container should send a request to this agent via unix domain socket.
> That is what we are doing now in our production environment.  That
> said, each container has to run a client to get the bpf object fd.

None of such hacks are necessary. People that debug bpf setups with bpftool
can always sudo.

> There are some downsides,
> -  It can't handle pinned bpf programs
>    For pinned programs, the user can get them from the pinned files
> directly, so he can use bpftool in his case, only with some
> complaints.
> -  If the user attached the bpf prog, and then removed the pinned
> file, but didn't detach it.
>    That happened. But this error case can't be handled.
> - There may be other corner cases that it can't fit.
> 
> There's a solution to improve it, but we also need to change the
> kernel. That is, we can use the wasted space btf->name.
> 
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index b7e5a55..59d73a3 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -5542,6 +5542,8 @@ static struct btf *btf_parse(bpfptr_t btf_data,
> u32 btf_data_size,
>                 err = -ENOMEM;
>                 goto errout;
>         }
> +       snprintf(btf->name, sizeof(btf->name), "%s-%d-%d", current->comm,
> +                        current->pid, cgroup_id(task_cgroup(p, cpu_cgrp_id)));

Unnecessary.
comm, pid, cgroup can be printed by bpftool without changing the kernel.