Re: [PATCH bpf-next v3 2/7] bpf: per-cgroup lsm flavor

Stanislav Fomichev <sdf@xxxxxxxxxx> · Tue, 12 Apr 2022 09:42:31 -0700

> On Mon, Apr 11, 2022 at 6:04 PM Martin KaFai Lau <kafai@xxxxxx> wrote:
>
> On Mon, Apr 11, 2022 at 12:07:20PM -0700, Stanislav Fomichev wrote:
> > On Fri, Apr 8, 2022 at 3:13 PM Martin KaFai Lau <kafai@xxxxxx> wrote:
> > >
> > > On Thu, Apr 07, 2022 at 03:31:07PM -0700, Stanislav Fomichev wrote:
> > > > diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c
> > > > index 064eccba641d..eca258ba71d8 100644
> > > > --- a/kernel/bpf/bpf_lsm.c
> > > > +++ b/kernel/bpf/bpf_lsm.c
> > > > @@ -35,6 +35,98 @@ BTF_SET_START(bpf_lsm_hooks)
> > > >  #undef LSM_HOOK
> > > >  BTF_SET_END(bpf_lsm_hooks)
> > > >
> > > > +static unsigned int __cgroup_bpf_run_lsm_socket(const void *ctx,
> > > > +                                             const struct bpf_insn *insn)
> > > > +{
> > > > +     const struct bpf_prog *prog;
> > > > +     struct socket *sock;
> > > > +     struct cgroup *cgrp;
> > > > +     struct sock *sk;
> > > > +     int ret = 0;
> > > > +     u64 *regs;
> > > > +
> > > > +     regs = (u64 *)ctx;
> > > > +     sock = (void *)(unsigned long)regs[BPF_REG_0];
> > > > +     /*prog = container_of(insn, struct bpf_prog, insnsi);*/
> > > > +     prog = (const struct bpf_prog *)((void *)insn - offsetof(struct bpf_prog, insnsi));
> > > nit. Rename prog to shim_prog.
> > >
> > > > +
> > > > +     if (unlikely(!sock))
> > > Is it possible in the lsm hooks?  Can these hooks
> > > be rejected at the load time instead?
> >
> > Doesn't seem like it can be null, at least from the quick review that
> > I had; I'll take a deeper look.
> > I guess in general I wanted to be more defensive here because there
> > are 200+ hooks, the new ones might arrive, and it's better to have the
> > check?
> not too worried about an extra runtime check for now.
> Instead, have a concern that it will be a usage surprise when a successfully
> attached bpf program is then always silently ignored.
>
> Another question, for example, the inet_conn_request lsm_hook:
> LSM_HOOK(int, 0, inet_conn_request, const struct sock *sk, struct sk_buff *skb,
>          struct request_sock *req)
>
> 'struct sock *sk' is the first argument, so it will use the current's cgroup.
> inet_conn_request() is likely run in a softirq though and then it will be
> incorrect.  This runs in softirq case may not be limited to hooks that
> take sk/sock argument also, not sure.

For now, I decided not to treat 'struct sock' cases as 'socket'
because of cases like sk_alloc_security where 'struct sock' is not
initialized. Looks like treating them as 'current' is also not 100%
foolproof. I guess we'd still have to have some special
cases/exceptions. Let me bring back that 'struct sock' handler and add
some btf-set to treat other non-inet_conn_request as the exception for
now.

> > > > +             return 0;
> > > > +
> > > > +     sk = sock->sk;
> > > > +     if (unlikely(!sk))
> > > Same here.
> > >
> > > > +             return 0;
> > > > +
> > > > +     cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > > > +     if (likely(cgrp))
> unrelated, but while talking extra check,
>
> I think the shim_prog has already acted as a higher level (per attach-btf_id)
> knob but do you think it may still worth to do a bpf_empty_prog_array
> check here in case a cgroup may not have prog to run ?

Oh yeah, good idea, let me add those cgroup_bpf_sock_enabled.

> > > > +             ret = BPF_PROG_RUN_ARRAY_CG(cgrp->bpf.effective[prog->aux->cgroup_atype],
> > > > +                                         ctx, bpf_prog_run, 0);
>
> [ ... ]
>
> > > > @@ -100,6 +123,15 @@ static void bpf_cgroup_link_auto_detach(struct bpf_cgroup_link *link)
> > > >       link->cgroup = NULL;
> > > >  }
> > > >
> > > > +static void bpf_cgroup_lsm_shim_release(struct bpf_prog *prog,
> > > > +                                     enum cgroup_bpf_attach_type atype)
> > > > +{
> > > > +     if (!prog || atype != prog->aux->cgroup_atype)
> > > prog cannot be NULL here, no?
> > >
> > > The 'atype != prog->aux->cgroup_atype' looks suspicious also considering
> > > prog->aux->cgroup_atype is only initialized (and meaningful) for BPF_LSM_CGROUP.
> > > I suspect incorrectly passing this test will crash in the below
> > > bpf_trampoline_unlink_cgroup_shim(). More on this later.
> > >
> > > > +             return;
> > > > +
> > > > +     bpf_trampoline_unlink_cgroup_shim(prog);
> > > > +}
> > > > +
> > > >  /**
> > > >   * cgroup_bpf_release() - put references of all bpf programs and
> > > >   *                        release all cgroup bpf data
> > > > @@ -123,10 +155,16 @@ static void cgroup_bpf_release(struct work_struct *work)
> > > Copying some missing loop context here:
> > >
> > >         for (atype = 0; atype < ARRAY_SIZE(cgrp->bpf.progs); atype++) {
> > >                 struct list_head *progs = &cgrp->bpf.progs[atype];
> > >                 struct bpf_prog_list *pl, *pltmp;
> > >
> > > >
> > > >               list_for_each_entry_safe(pl, pltmp, progs, node) {
> > > >                       list_del(&pl->node);
> > > > -                     if (pl->prog)
> > > > +                     if (pl->prog) {
> > > > +                             bpf_cgroup_lsm_shim_release(pl->prog,
> > > > +                                                         atype);
> > > atype could be 0 (CGROUP_INET_INGRESS) here.  bpf_cgroup_lsm_shim_release()
> > > above will go ahead with bpf_trampoline_unlink_cgroup_shim().
> > > It will break some of the assumptions.  e.g. prog->aux->attach_btf is NULL
> > > for CGROUP_INET_INGRESS.
> > >
> > > Instead, only call bpf_cgroup_lsm_shim_release() for BPF_LSM_CGROUP ?
> > >
> > > If the above observation is sane, I wonder if the existing test_progs
> > > have uncovered it or may be the existing tests just always detach
> > > cleanly itself before cleaning the cgroup which then avoided this case.
> >
> > Might be what's happening here:
> >
> > https://github.com/kernel-patches/bpf/runs/5876983908?check_suite_focus=true
> hmm.... this one looks different.  I am thinking the oops should happen
> in bpf_obj_id() which is not inlined.  didn't ring any bell for now
> after a quick look, so yeah let's fix the known first.
>
> >
> > Although, I'm not sure why it's z15 only. Good point on filtering by
> > BPF_LSM_CGROUP, will do.
> >
> > > >                               bpf_prog_put(pl->prog);
> > > > -                     if (pl->link)
> > > > +                     }
> > > > +                     if (pl->link) {
> > > > +                             bpf_cgroup_lsm_shim_release(pl->link->link.prog,
> > > > +                                                         atype);
> > > >                               bpf_cgroup_link_auto_detach(pl->link);
> > > > +                     }
> > > >                       kfree(pl);
> > > >                       static_branch_dec(&cgroup_bpf_enabled_key[atype]);
> > > >               }
>
> [ ... ]
>
> > > > +int bpf_trampoline_link_cgroup_shim(struct bpf_prog *prog,
> > > > +                                 struct bpf_attach_target_info *tgt_info)
> > > > +{
> > > > +     struct bpf_prog *shim_prog = NULL;
> > > > +     struct bpf_trampoline *tr;
> > > > +     bpf_func_t bpf_func;
> > > > +     u64 key;
> > > > +     int err;
> > > > +
> > > > +     key = bpf_trampoline_compute_key(NULL, prog->aux->attach_btf,
> > > > +                                      prog->aux->attach_btf_id);
> > > > +
> > > > +     err = bpf_lsm_find_cgroup_shim(prog, &bpf_func);
> > > > +     if (err)
> > > > +             return err;
> > > > +
> > > > +     tr = bpf_trampoline_get(key, tgt_info);
> > > > +     if (!tr)
> > > > +             return  -ENOMEM;
> > > > +
> > > > +     mutex_lock(&tr->mutex);
> > > > +
> > > > +     shim_prog = cgroup_shim_find(tr, bpf_func);
> > > > +     if (shim_prog) {
> > > > +             /* Reusing existing shim attached by the other program.
> > > > +              */
> > > The shim_prog is reused by >1 BPF_LSM_CGROUP progs and
> > > shim_prog is hidden from the userspace also (no id), so it may worth
> > > to bring this up:
> > >
> > > In __bpf_prog_enter(), other than some bpf stats of the shim_prog
> > > will become useless which is a very minor thing, it is also checking
> > > shim_prog->active and bump the misses counter.  Now, the misses counter
> > > is no longer visible to users.  Since it is actually running the cgroup prog,
> > > may be there is no need for the active check ?
> >
> > Agree that the active counter will probably be taken care of when the
> > actual program runs;
> iirc, the BPF_PROG_RUN_ARRAY_CG does not need the active counter.
>
> > but now sure it worth the effort in trying to
> > remove it here?
> I was thinking if the active counter got triggered and missed calling the
> BPF_LSM_CGROUP, then there is no way to tell this case got hit without
> exposing the stats of the shim_prog and it could be a pretty hard
> problem to chase.  It probably won't be an issue for non-sleepable now
> if the rcu_read_lock() maps to preempt_disable().  Not sure about the
> future sleepable case.
>
> I am thinking to avoid doing all the active count and stats count
> in __bpf_prog_enter() and __bpf_prog_exit() for BPF_LSM_CGROUP.  afaik,
> only the rcu_read_lock and rcu_read_unlock are useful to protect
> the shim_prog itself.  May be a __bpf_nostats_enter() and
> __bpf_nostats_exit().

SG, let me try to skip that for BPF_LSM_CGROUP case.

> > Regarding "no longer visible to users": that's a good point. Should I
> > actually add those shim progs to the prog_idr? Or just hide it as
> > "internal implementation detail"?
> Then no need to expose the shim_progs to the idr.
>
> ~~~~
> [ btw, while thinking the shim_prog, I also think there is no need for one
>   shim_prog for each attach_btf_id which is essentially
>   prog->aux->cgroup_atype.  The static prog->aux->cgroup_atype can be
>   passed in the stack when preparing the trampoline.
>   just an idea and not suggesting must be done now.  This can be
>   optimized later since it does not affect the API. ]

Ack, I guess in theory, there needs to be only two "global"
shim_progs, one for 'struct socket' and another for 'current' (or
more, for other types). I went with allocating an instance per
trampoline to avoid having that global state. Working under tr->mutex
simplifies things a bit imo, but, as you said, we can optimize here if
needed.