Re: [RFC PATCH bpf-next] bpf: Allow get bpf object with CAP_BPF

Yafang Shao <laoar.shao@xxxxxxxxx> · Thu, 1 Dec 2022 22:46:44 +0800

On Thu, Dec 1, 2022 at 8:38 AM Hao Luo <haoluo@xxxxxxxxxx> wrote:
>
> On Wed, Nov 30, 2022 at 10:07 AM Song Liu <song@xxxxxxxxxx> wrote:
> >
> > On Wed, Nov 30, 2022 at 3:59 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > >
> > [...]
> > > I understand that allowing ID->FD transition for CAP_SYS_ADMIN only is
> > > for security.
> > > But it also prevents the user from transiting its own bpf object ID,
> > > that is a problem.
> > >
> > > > From the commit message, I'm not clear how BPF is debugged in
> > > > containers in your use case. Maybe the debugging process should be
> > > > required to have CAP_SYS_ADMIN?
> > > >
> > >
> > > Some container users will run bpf programs in their container,
> > > sometimes they want to check the bpf objects created by themselves  by
> > > using bpftool or read/write the bpf maps with their own tools. But if
> > > the bpf objects are not pinned, the only way to get these bpf objects
> > > is via SCM_RIGHTS.
> > > There should be a general way to get the FD of their own objects when
> > > CAP_BPF is enabled.
> > > With CAP_SYS_ADMIN, the container user can do almost anything, which
> > > is very dangerous.
> > > While with CAP_BPF, the risk can be kept within BPF.
> > >
> > > I think we should improve this situation by allowing the user to
> > > transit its own bpf object IDs.
> > > There are some possible solutions,
> > > 1. introduce BPF_ID namespace
> > >     Let's use namespace to isolate the bpf object ID instead of
> > > preventing them from reading all IDs.
> > > 2. introduce a global sysctl knob to allow users to do the ID->FD transition
> > >     for example, introduce a new value into unprivileged_bpf_disabled.
> > >     -0 Unprivileged calls to ``bpf()`` are enabled
> > >    +0 Unprivileged calls to ``bpf()`` are enabled except the calls
> > >    +  which explicitly requires ``CAP_BPF`` or ``CAP_SYS_ADMIN``
> > >     1 Unprivileged calls to ``bpf()`` are disabled without recovery
> > >     2 Unprivileged calls to ``bpf()`` are disabled
> > >   +3 All unprivileged calls to ``bpf()`` are enabled
> > >
> > > WDYT ?
> >
> > Personally, I think some namespace might be the solution we need.
> > But adding a namespace is a lot of work, so we need to make sure to
> > do it correctly.
> >
> > This might be a good topic to discuss in the BPF office hour.
> >
>
> I think namespace is more preferable. A discussion in the BPF office
> hour sounds good.
>
> Following are my thoughts:
>

Thanks for your thoughts.

> 1. What does the BPF_ID namespace look like? Will it be like the PID
> namespace, remapping IDs in each namespace? or just restricting the
> object IDs visible to the users?
>

I prefer the former.  It looks like the PID namespace, which also uses
the idr_alloc().

> 2. What's wrong with passing FD? Is it really necessary to introduce a
> namespace for this purpose?
>

Passing FD is not flexible, and generic tools like bpftool can't work.
In the long run, I think the restriction of CAP_SYS_ADMIN should be
replaced by better isolation mechanisms, so introducing a namespace to
replace it won't be a bad idea.

> 3. IIRC, Song proposed introducing a namespace for BPF isolation, not
> just isolating IDs [1]. How does it relate to the BPF_ID namespace?
>
> [1] https://lore.kernel.org/all/CAPhsuW6c17p3XkzSxxo7YBW9LHjqerOqQvt7C1+S--8C9omeng@xxxxxxxxxxxxxx/

I have looked through the slides of this proposal, but failed to
figure out how Song will design the BPF namespace. Maybe Song can give
us a better explanation.
Per my understanding, the goal of Song's proposal should be combined
by many namespaces and other isolation mechanisms.  For example, with
the help of PID namespace, we can make sure only the tasks in this
container can be traced by the bpf programs running in it.

-- 
Regards
Yafang