Hi Daniel, On Fri, Sep 29, 2023 at 11:48 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > > On 9/26/23 1:01 AM, Jamal Hadi Salim wrote: > > On Fri, Sep 22, 2023 at 4:12 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > >> On 9/20/23 1:20 AM, Jamal Hadi Salim wrote: > >>> On Tue, Sep 19, 2023 at 6:15 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > >>>> On 9/19/23 4:59 PM, Victor Nogueira wrote: > [...] > >> > >> In the above case we don't have 'internal' errors which you want to trace, so I would > >> also love to avoid the cost of zeroing struct tcf_result res which should be 3x 8b for > >> every packet. > > > > We can move the zeroing inside tc_run() but we declare it in the same > > spot as we do right now. You will still need to set res.verdict as > > above. > > Would that work for you? > > What I'm not following is that with the below you can avoid the unnecessary > fast path cost (which is only for corner case which is almost never hit) and > get even better visibility. Are you saying it doesn't work? I am probably missing something: -1/UNSPEC is a legit errno. And the main motivation here for this patch is to disambiguate if it was -EPERM vs UNSPEC Maybe that is what you are calling a "corner case"? There are two options in my mind right now (since you are guaranteed in tcx_run you will never return anything below UNSPEC): 1) we just have the switch statement invocation inside an inline function and you can pass it sch_ret (for tcx case) and we'll pass it res.verdit for tc_run() case. 2) is something is we leave tcx_run alone and we have something along the lines of: -------------- diff --git a/net/core/dev.c b/net/core/dev.c index 1450f4741d9b..93613bce647c 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3985,7 +3985,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, struct net_device *orig_dev, bool *another) { struct bpf_mprog_entry *entry = rcu_dereference_bh(skb->dev->tcx_ingress); - struct tcf_result res = {0}; + struct tcf_result res; int sch_ret; if (!entry) @@ -4003,14 +4003,16 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, if (sch_ret != TC_ACT_UNSPEC) goto ingress_verdict; } + + res.verdict = 0; sch_ret = tc_run(tcx_entry(entry), skb, &res); if (sch_ret < 0) { kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS_ERROR); *ret = NET_RX_DROP; return NULL; } + sch_ret = res.verdict; ingress_verdict: - switch (res.verdict) { + switch (sch_ret) { case TC_ACT_REDIRECT: /* skb_mac_header check was done by BPF, so we can safely * push the L2 header back before redirecting to another ----------- on the drop reason - our thinking is to support drop_watch alongside tracepoint given kfree_skb_reason exists already; if i am not mistaken what you suggested would require us to create a new tracepoint? cheers, jamal > >> I was more thinking like something below could be a better choice. I presume your main > >> goal is to trace where these errors originated in the first place, so it might even be > >> useful to capture the actual return code as well. > > > > The main motivation is a few syzkaller bugs which resulted in not > > disambiguating between errors being returned and sometimes > > TC_ACT_SHOT. > > > >> Then you can use perf script, bpf and whatnot to gather further insights into what > >> happened while being less invasive and avoiding the need to extend struct tcf_result. > > > > We could use trace instead - the reason we have the skb reason is > > being used in the other spots (does this trace require ebpf to be > > usable?). > > No you can just use regular perf by attaching to the tracepoint, no need for using > bpf at all here. > > >> This would be quite similar to trace_xdp_exception() as well, and I think you can guarantee > >> that in fast path all errors are < TC_ACT_UNSPEC anyway. > > > > I am not sure i followed. 0 means success, result codes are returned in res now. > > What I was saying is that you don't need the struct change from the patch, but only > the changes where you rework TC_ACT_SHOT into one of the -E<errors>, and then with > the below you can pass this through an exception tracepoint. > > >> diff --git a/net/core/dev.c b/net/core/dev.c > >> index 85df22f05c38..4089d195144d 100644 > >> --- a/net/core/dev.c > >> +++ b/net/core/dev.c > >> @@ -3925,6 +3925,10 @@ static int tc_run(struct tcx_entry *entry, struct sk_buff *skb) > >> > >> mini_qdisc_bstats_cpu_update(miniq, skb); > >> ret = tcf_classify(skb, miniq->block, miniq->filter_list, &res, false); > >> + if (unlikely(ret < TC_ACT_UNSPEC)) { > >> + trace_tc_exception(skb->dev, skb->tc_at_ingress, ret); > >> + ret = TC_ACT_SHOT; > >> + } > >> /* Only tcf related quirks below. */ > >> switch (ret) { > >> case TC_ACT_SHOT: > > Thanks, > Daniel