Re: [PATCH bpf-next v2 0/3] bpf: expose bpf_{g,s}et_retval to more cgroup hooks

Martin KaFai Lau <kafai@xxxxxx> · Wed, 17 Aug 2022 17:21:44 -0700



On Wed, Aug 17, 2022 at 04:59:06PM -0700, Stanislav Fomichev wrote:
> On Wed, Aug 17, 2022 at 4:27 PM Martin KaFai Lau <kafai@xxxxxx> wrote:
> >
> > On Wed, Aug 17, 2022 at 03:41:26PM -0700, Stanislav Fomichev wrote:
> > > On Wed, Aug 17, 2022 at 12:07 PM Martin KaFai Lau <kafai@xxxxxx> wrote:
> > > >
> > > > On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote:
> > > > > Apparently, only a small subset of cgroup hooks actually falls
> > > > > back to cgroup_base_func_proto. This leads to unexpected result
> > > > > where not all cgroup helpers have bpf_{g,s}et_retval.
> > > > >
> > > > > It's getting harder and harder to manage which helpers are exported
> > > > > to which hooks. We now have the following call chains:
> > > > >
> > > > > - cg_skb_func_proto
> > > > >   - sk_filter_func_proto
> > > > >     - bpf_sk_base_func_proto
> > > > >       - bpf_base_func_proto
> > > > Could you explain how bpf_set_retval() will work with cgroup prog that
> > > > is not syscall and can return flags in the higher bit (e.g. cg_skb egress).
> > > > It will be a useful doc to add to the uapi bpf.h for
> > > > the bpf_set_retval() helper.
> > >
> > > I think it's the same case as the case without bpf_set_retval? I don't
> > > think the flags can be exported via bpf_set_retval, it just lets the
> > > users override EPERM.
> > eg. Before, a cg_skb@egress prog returns 3 to mean NET_XMIT_CN.
> > What if the prog now returns 3 and also bpf_set_retval(-Exxxx).
> > If I read how __cgroup_bpf_run_filter_skb() uses bpf_prog_run_array_cg()
> > correctly,  __cgroup_bpf_run_filter_skb() will return NET_XMIT_DROP
> > instead of the -Exxxx.  The -Exxxx is probably what the bpf prog
> > is expecting after calling bpf_set_retval(-Exxxx) ?
> > Thinking more about it, should __cgroup_bpf_run_filter_skb() always
> > return -Exxxx whenever a -ve retval is set in bpf_set_retval() ?
> 
> I think we used to have "return 0/1/2/3" to indicate different
> conditions but then switched to "return 1/0" + flags.
For 'int bpf_prog_run_array_cg(..., u32 *ret_flags)'?
I think it is more like return "0 (OK)/-Exxxx" + ret_flags now.

> So, technically, "return 3 + bpf_set_retval" is still fundamentally a
> "return 3" api-wise.
hm....for the exisiting usecase (eg. CGROUP_SETSOCKOPT), what does
"bpf-prog-return 1 + bpf_set_retval(-EPERM)" mean?

> I guess we can make bpf_set_retval override that but let me start by
> trying to document what we currently have.
To be clear, for cg_skb case, I meant to clear the ret_flags only if
run_ctx.retval is set.

> If it turns out to be super ugly, we can try to fix it. (not sure how
> much of a uapi that is)
sgtm.

> 
> 
> 
> > > Let me verify and I can add a note to bpf_set_retval uapi definition
> > > to mention that it just overrides EPERM. bpf_set_retval should
> > > probably not talk about userspace/syscall and instead use the words
> > > like "caller".
> > yeah, it is no longer syscall return value only now.