Re: [PATCH bpf-next v2 0/3] bpf: expose bpf_{g,s}et_retval to more cgroup hooks

Martin KaFai Lau <kafai@xxxxxx> · Thu, 18 Aug 2022 12:54:39 -0700

On Thu, Aug 18, 2022 at 11:30:21AM -0700, Stanislav Fomichev wrote:
> On Thu, Aug 18, 2022 at 11:16 AM Martin KaFai Lau <kafai@xxxxxx> wrote:
> >
> > On Wed, Aug 17, 2022 at 08:42:54PM -0700, Stanislav Fomichev wrote:
> > > On Wed, Aug 17, 2022 at 5:21 PM Martin KaFai Lau <kafai@xxxxxx> wrote:
> > > >
> > > > On Wed, Aug 17, 2022 at 04:59:06PM -0700, Stanislav Fomichev wrote:
> > > > > On Wed, Aug 17, 2022 at 4:27 PM Martin KaFai Lau <kafai@xxxxxx> wrote:
> > > > > >
> > > > > > On Wed, Aug 17, 2022 at 03:41:26PM -0700, Stanislav Fomichev wrote:
> > > > > > > On Wed, Aug 17, 2022 at 12:07 PM Martin KaFai Lau <kafai@xxxxxx> wrote:
> > > > > > > >
> > > > > > > > On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote:
> > > > > > > > > Apparently, only a small subset of cgroup hooks actually falls
> > > > > > > > > back to cgroup_base_func_proto. This leads to unexpected result
> > > > > > > > > where not all cgroup helpers have bpf_{g,s}et_retval.
> > > > > > > > >
> > > > > > > > > It's getting harder and harder to manage which helpers are exported
> > > > > > > > > to which hooks. We now have the following call chains:
> > > > > > > > >
> > > > > > > > > - cg_skb_func_proto
> > > > > > > > >   - sk_filter_func_proto
> > > > > > > > >     - bpf_sk_base_func_proto
> > > > > > > > >       - bpf_base_func_proto
> > > > > > > > Could you explain how bpf_set_retval() will work with cgroup prog that
> > > > > > > > is not syscall and can return flags in the higher bit (e.g. cg_skb egress).
> > > > > > > > It will be a useful doc to add to the uapi bpf.h for
> > > > > > > > the bpf_set_retval() helper.
> > > > > > >
> > > > > > > I think it's the same case as the case without bpf_set_retval? I don't
> > > > > > > think the flags can be exported via bpf_set_retval, it just lets the
> > > > > > > users override EPERM.
> > > > > > eg. Before, a cg_skb@egress prog returns 3 to mean NET_XMIT_CN.
> > > > > > What if the prog now returns 3 and also bpf_set_retval(-Exxxx).
> > > > > > If I read how __cgroup_bpf_run_filter_skb() uses bpf_prog_run_array_cg()
> > > > > > correctly,  __cgroup_bpf_run_filter_skb() will return NET_XMIT_DROP
> > > > > > instead of the -Exxxx.  The -Exxxx is probably what the bpf prog
> > > > > > is expecting after calling bpf_set_retval(-Exxxx) ?
> > > > > > Thinking more about it, should __cgroup_bpf_run_filter_skb() always
> > > > > > return -Exxxx whenever a -ve retval is set in bpf_set_retval() ?
> > > > >
> > > > > I think we used to have "return 0/1/2/3" to indicate different
> > > > > conditions but then switched to "return 1/0" + flags.
> > > > For 'int bpf_prog_run_array_cg(..., u32 *ret_flags)'?
> > > > I think it is more like return "0 (OK)/-Exxxx" + ret_flags now.
> > >
> > > Yes, right now that's that case. What I meant to say is that for the
> > > BPF program itself, the api is still "return a set of predefined
> > > values". We don't advertise the flags to the bpf programs. 'return 2'
> > > is a perfectly valid return for cgroup/egress that will tx the packet
> > > with a cn. (where bpf_prog_run_array_cg sees it as a 'return 0 + (1 <<
> > > 1)')
> > >
> > > > > So, technically, "return 3 + bpf_set_retval" is still fundamentally a
> > > > > "return 3" api-wise.
> > > > hm....for the exisiting usecase (eg. CGROUP_SETSOCKOPT), what does
> > > > "bpf-prog-return 1 + bpf_set_retval(-EPERM)" mean?
> > >
> > > I think bpf_set_retval takes precedence and in this case bpf_prog_run
> > > wrapper will return -EPERM to the caller.
> > > Will try to document that as well.
> > >
> > > > > I guess we can make bpf_set_retval override that but let me start by
> > > > > trying to document what we currently have.
> > > > To be clear, for cg_skb case, I meant to clear the ret_flags only if
> > > > run_ctx.retval is set.
> > >
> > > Are you suggesting something like the following?
> > >
> > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> > > index fd113bd2f79c..c110cbe52001 100644
> > > --- a/kernel/bpf/cgroup.c
> > > +++ b/kernel/bpf/cgroup.c
> > > @@ -61,6 +61,8 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
> > >         bpf_reset_run_ctx(old_run_ctx);
> > >         rcu_read_unlock();
> > >         migrate_enable();
> > > +       if (IS_ERR_VALUE((long)run_ctx.retval))
> > > +               *ret_flags = 0;
> > >         return run_ctx.retval;
> > >  }
> > >
> > > I think this will break the 'return 2' case? But is it worth it doing
> > > it more carefully like this? LMKWYT.
> > The below should work. Not sure it is worth it
> > but before doing this...
> >
> > During this discussion, I think I am not clear what is the use case
> > on bpf_{g,s}et_retval() for cg_skb.  Could you describe how it will be
> > used in your use case?  Is it for another tracing program to get
> > a different return value from (eg.) sk_filter_trim_cap or ip[6]_output?
> >
> > Not meaning the helper should not be exposed.  It is easier
> > to think with some examples.
> 
> I don't really need them in cg_skb, I want them in cg_sock so I can
> return a custom errno from socket() syscall.
bpf_{g,s}et_retval() will probably be useful for sock_addr too.
There is a BPF_RET_BIND_NO_CAP_NET_BIND_SERVICE falling
into a similar bucket.  However, I think it should be fine
since the higher bit is only used when the bpf prog
returns OK.

> You're probably right and it doesn't make sense to support them in
> cg_skb. Most of the
> BPF_CGROUP_RUN_PROG_INET_INGRESS/BPF_CGROUP_RUN_PROG_INET_EGRESS
> callers don't seem to care about returned error code? (from my brief
> grepping)
> Let's maybe err on the safe side and special case cg_skb for now (in
> cgroup_common_func_proto) and not expose retval helpers?
Ah. I was under the wrong impression that you have a use case
on cg_skb because cg_skb_func_proto was in the stack
example above :)

Yep, lets disable the bpf_{g,s}et_retval() helper for CGROUP_SKB
and also SOCK_OP(S) for now.