On Wed, Aug 17, 2022 at 08:42:54PM -0700, Stanislav Fomichev wrote: > On Wed, Aug 17, 2022 at 5:21 PM Martin KaFai Lau <kafai@xxxxxx> wrote: > > > > On Wed, Aug 17, 2022 at 04:59:06PM -0700, Stanislav Fomichev wrote: > > > On Wed, Aug 17, 2022 at 4:27 PM Martin KaFai Lau <kafai@xxxxxx> wrote: > > > > > > > > On Wed, Aug 17, 2022 at 03:41:26PM -0700, Stanislav Fomichev wrote: > > > > > On Wed, Aug 17, 2022 at 12:07 PM Martin KaFai Lau <kafai@xxxxxx> wrote: > > > > > > > > > > > > On Tue, Aug 16, 2022 at 01:12:11PM -0700, Stanislav Fomichev wrote: > > > > > > > Apparently, only a small subset of cgroup hooks actually falls > > > > > > > back to cgroup_base_func_proto. This leads to unexpected result > > > > > > > where not all cgroup helpers have bpf_{g,s}et_retval. > > > > > > > > > > > > > > It's getting harder and harder to manage which helpers are exported > > > > > > > to which hooks. We now have the following call chains: > > > > > > > > > > > > > > - cg_skb_func_proto > > > > > > > - sk_filter_func_proto > > > > > > > - bpf_sk_base_func_proto > > > > > > > - bpf_base_func_proto > > > > > > Could you explain how bpf_set_retval() will work with cgroup prog that > > > > > > is not syscall and can return flags in the higher bit (e.g. cg_skb egress). > > > > > > It will be a useful doc to add to the uapi bpf.h for > > > > > > the bpf_set_retval() helper. > > > > > > > > > > I think it's the same case as the case without bpf_set_retval? I don't > > > > > think the flags can be exported via bpf_set_retval, it just lets the > > > > > users override EPERM. > > > > eg. Before, a cg_skb@egress prog returns 3 to mean NET_XMIT_CN. > > > > What if the prog now returns 3 and also bpf_set_retval(-Exxxx). > > > > If I read how __cgroup_bpf_run_filter_skb() uses bpf_prog_run_array_cg() > > > > correctly, __cgroup_bpf_run_filter_skb() will return NET_XMIT_DROP > > > > instead of the -Exxxx. The -Exxxx is probably what the bpf prog > > > > is expecting after calling bpf_set_retval(-Exxxx) ? > > > > Thinking more about it, should __cgroup_bpf_run_filter_skb() always > > > > return -Exxxx whenever a -ve retval is set in bpf_set_retval() ? > > > > > > I think we used to have "return 0/1/2/3" to indicate different > > > conditions but then switched to "return 1/0" + flags. > > For 'int bpf_prog_run_array_cg(..., u32 *ret_flags)'? > > I think it is more like return "0 (OK)/-Exxxx" + ret_flags now. > > Yes, right now that's that case. What I meant to say is that for the > BPF program itself, the api is still "return a set of predefined > values". We don't advertise the flags to the bpf programs. 'return 2' > is a perfectly valid return for cgroup/egress that will tx the packet > with a cn. (where bpf_prog_run_array_cg sees it as a 'return 0 + (1 << > 1)') > > > > So, technically, "return 3 + bpf_set_retval" is still fundamentally a > > > "return 3" api-wise. > > hm....for the exisiting usecase (eg. CGROUP_SETSOCKOPT), what does > > "bpf-prog-return 1 + bpf_set_retval(-EPERM)" mean? > > I think bpf_set_retval takes precedence and in this case bpf_prog_run > wrapper will return -EPERM to the caller. > Will try to document that as well. > > > > I guess we can make bpf_set_retval override that but let me start by > > > trying to document what we currently have. > > To be clear, for cg_skb case, I meant to clear the ret_flags only if > > run_ctx.retval is set. > > Are you suggesting something like the following? > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c > index fd113bd2f79c..c110cbe52001 100644 > --- a/kernel/bpf/cgroup.c > +++ b/kernel/bpf/cgroup.c > @@ -61,6 +61,8 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp, > bpf_reset_run_ctx(old_run_ctx); > rcu_read_unlock(); > migrate_enable(); > + if (IS_ERR_VALUE((long)run_ctx.retval)) > + *ret_flags = 0; > return run_ctx.retval; > } > > I think this will break the 'return 2' case? But is it worth it doing > it more carefully like this? LMKWYT. The below should work. Not sure it is worth it but before doing this... During this discussion, I think I am not clear what is the use case on bpf_{g,s}et_retval() for cg_skb. Could you describe how it will be used in your use case? Is it for another tracing program to get a different return value from (eg.) sk_filter_trim_cap or ip[6]_output? Not meaning the helper should not be exposed. It is easier to think with some examples. > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c > index fd113bd2f79c..dcd25561f8d4 100644 > --- a/kernel/bpf/cgroup.c > +++ b/kernel/bpf/cgroup.c > @@ -39,6 +39,7 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp, > const struct bpf_prog_array *array; > struct bpf_run_ctx *old_run_ctx; > struct bpf_cg_run_ctx run_ctx; > + bool seen_return0 = false; > u32 func_ret; > > run_ctx.retval = retval; > @@ -54,13 +55,17 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp, > *(ret_flags) |= (func_ret >> 1); > func_ret &= 1; > } > - if (!func_ret && !IS_ERR_VALUE((long)run_ctx.retval)) > + if (!func_ret && !IS_ERR_VALUE((long)run_ctx.retval)) { > run_ctx.retval = -EPERM; > + seen_return0 = true; > + } > item++; > } > bpf_reset_run_ctx(old_run_ctx); > rcu_read_unlock(); > migrate_enable(); > + if (IS_ERR_VALUE((long)run_ctx.retval) && !seen_return0) > + *ret_flags = 0; > return run_ctx.retval; > } > > > > If it turns out to be super ugly, we can try to fix it. (not sure how > > > much of a uapi that is) > > sgtm. > > > > > > > > > > > > > > > > Let me verify and I can add a note to bpf_set_retval uapi definition > > > > > to mention that it just overrides EPERM. bpf_set_retval should > > > > > probably not talk about userspace/syscall and instead use the words > > > > > like "caller". > > > > yeah, it is no longer syscall return value only now.