On Tue, Oct 3, 2023 at 9:49 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > > On 10/3/23 2:46 PM, Jamal Hadi Salim wrote: > > On Tue, Oct 3, 2023 at 5:00 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > >> On 10/2/23 9:54 PM, Jamal Hadi Salim wrote: > >>> On Fri, Sep 29, 2023 at 11:48 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > >>>> On 9/26/23 1:01 AM, Jamal Hadi Salim wrote: > >>>>> On Fri, Sep 22, 2023 at 4:12 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > >>>>>> On 9/20/23 1:20 AM, Jamal Hadi Salim wrote: > >>>>>>> On Tue, Sep 19, 2023 at 6:15 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > >>>>>>>> On 9/19/23 4:59 PM, Victor Nogueira wrote: > >>>> [...] > >>>>>> > >>>>>> In the above case we don't have 'internal' errors which you want to trace, so I would > >>>>>> also love to avoid the cost of zeroing struct tcf_result res which should be 3x 8b for > >>>>>> every packet. > >>>>> > >>>>> We can move the zeroing inside tc_run() but we declare it in the same > >>>>> spot as we do right now. You will still need to set res.verdict as > >>>>> above. > >>>>> Would that work for you? > >>>> > >>>> What I'm not following is that with the below you can avoid the unnecessary > >>>> fast path cost (which is only for corner case which is almost never hit) and > >>>> get even better visibility. Are you saying it doesn't work? > >>> > >>> I am probably missing something: > >>> -1/UNSPEC is a legit errno. And the main motivation here for this > >>> patch is to disambiguate if it was -EPERM vs UNSPEC > >>> Maybe that is what you are calling a "corner case"? > >> > >> Yes, but what is the use-case to ever return a -EPERM from the fast-path? This can > >> be audited for the code in the tree and therefore avoided so that you never run into > >> this problem. > > > > I am sorry but i am not in favor of this approach. > > You are suggesting audits are the way to go forward when in fact lack > > of said audits is what got us in this trouble with syzkaller to begin > > with. We cant rely on tribal knowledge to be able to spot these > > discrepancies. The elder of the tribe may move to a different mountain > > at some point and TheLinuxWay(tm) is cutnpaste, so i dont see this as > > long term good for maintainance. We have a clear distinction between > > an error vs verdict - lets use that. > > We really dont want to make this a special case just for eBPF and how > > to make it a happy world for eBPF at the cost of everyone else. I made > > a suggestion of leaving tcx alone, you can do your own thing there; > > but for tc_run my view is we should keep it generic. > > Jamal, before you come to early conclusions, it would be great if you also > read until the end of the email, because what I suggested below *is* generic > and with less churn throughout the code base. > I did look, Daniel. You are lumping all the error codes into one - which doesnt change my view on disambiguation. If i was to debug closely and run kprobe now i am seeing only one error code TC_ACT_ABORT instead of -EINVAL vs -ENOMEM, etc. Easier for me to find the source manually (and possibly even better with Andrii's tool i saw once if it would work in the datapath - iirc, i think it prints return codes on the code paths). cheers, jamal > >>> There are two options in my mind right now (since you are guaranteed > >>> in tcx_run you will never return anything below UNSPEC): > >>> 1) we just have the switch statement invocation inside an inline > >>> function and you can pass it sch_ret (for tcx case) and we'll pass it > >>> res.verdit for tc_run() case. > >>> 2) is something is we leave tcx_run alone and we have something along > >>> the lines of: > >>> > >>> -------------- > >>> diff --git a/net/core/dev.c b/net/core/dev.c > >>> index 1450f4741d9b..93613bce647c 100644 > >>> --- a/net/core/dev.c > >>> +++ b/net/core/dev.c > >>> @@ -3985,7 +3985,7 @@ sch_handle_ingress(struct sk_buff *skb, struct > >>> packet_type **pt_prev, int *ret, > >>> struct net_device *orig_dev, bool *another) > >>> { > >>> struct bpf_mprog_entry *entry = > >>> rcu_dereference_bh(skb->dev->tcx_ingress); > >>> - struct tcf_result res = {0}; > >>> + struct tcf_result res; > >>> int sch_ret; > >>> > >>> if (!entry) > >>> @@ -4003,14 +4003,16 @@ sch_handle_ingress(struct sk_buff *skb, struct > >>> packet_type **pt_prev, int *ret, > >>> if (sch_ret != TC_ACT_UNSPEC) > >>> goto ingress_verdict; > >>> } > >>> + > >>> + res.verdict = 0; > >>> sch_ret = tc_run(tcx_entry(entry), skb, &res); > >>> if (sch_ret < 0) { > >>> kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS_ERROR); > >>> *ret = NET_RX_DROP; > >>> return NULL; > >>> } > >>> + sch_ret = res.verdict; > >>> ingress_verdict: > >>> - switch (res.verdict) { > >>> + switch (sch_ret) { > >>> case TC_ACT_REDIRECT: > >>> /* skb_mac_header check was done by BPF, so we can > >>> safely > >>> * push the L2 header back before redirecting to another > >>> ----------- > >>> > >>> on the drop reason - our thinking is to support drop_watch alongside > >>> tracepoint given kfree_skb_reason exists already; if i am not mistaken > >>> what you suggested would require us to create a new tracepoint? > >> > >> So if the only thing you really care about is the different drop reason for > >> kfree_skb_reason, then I still don't follow why you need to drag this into > >> struct tcf_result. This can be done in a much simpler and more efficient way > >> like the following: > >> > >> diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h > >> index a587e83fc169..b1c069c8e7f2 100644 > >> --- a/include/net/dropreason-core.h > >> +++ b/include/net/dropreason-core.h > >> @@ -80,6 +80,8 @@ > >> FN(IPV6_NDISC_BAD_OPTIONS) \ > >> FN(IPV6_NDISC_NS_OTHERHOST) \ > >> FN(QUEUE_PURGE) \ > >> + FN(TC_EGRESS_ERROR) \ > >> + FN(TC_INGRESS_ERROR) \ > >> FNe(MAX) > >> > >> /** > >> @@ -345,6 +347,10 @@ enum skb_drop_reason { > >> SKB_DROP_REASON_IPV6_NDISC_NS_OTHERHOST, > >> /** @SKB_DROP_REASON_QUEUE_PURGE: bulk free. */ > >> SKB_DROP_REASON_QUEUE_PURGE, > >> + /** @SKB_DROP_REASON_TC_EGRESS_ERROR: dropped in TC egress HOOK due to error */ > >> + SKB_DROP_REASON_TC_EGRESS_ERROR, > >> + /** @SKB_DROP_REASON_TC_INGRESS_ERROR: dropped in TC ingress HOOK due to error */ > >> + SKB_DROP_REASON_TC_INGRESS_ERROR, > >> /** > >> * @SKB_DROP_REASON_MAX: the maximum of core drop reasons, which > >> * shouldn't be used as a real 'reason' - only for tracing code gen > >> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h > >> index f308e8268651..cd2444dd3745 100644 > >> --- a/include/net/pkt_cls.h > >> +++ b/include/net/pkt_cls.h > >> @@ -10,6 +10,7 @@ > >> > >> /* TC action not accessible from user space */ > >> #define TC_ACT_CONSUMED (TC_ACT_VALUE_MAX + 1) > >> +#define TC_ACT_ABORT (TC_ACT_VALUE_MAX + 2) > >> > >> /* Basic packet classifier frontend definitions. */ > >> > >> diff --git a/net/core/dev.c b/net/core/dev.c > >> index 85df22f05c38..3abb4d71c170 100644 > >> --- a/net/core/dev.c > >> +++ b/net/core/dev.c > >> @@ -4011,7 +4011,10 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, > >> *ret = NET_RX_SUCCESS; > >> return NULL; > >> case TC_ACT_SHOT: > >> - kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS); > >> + case TC_ACT_ABORT: > >> + kfree_skb_reason(skb, likely(sch_ret == TC_ACT_SHOT) ? > >> + SKB_DROP_REASON_TC_INGRESS : > >> + SKB_DROP_REASON_TC_INGRESS_ERROR); > >> *ret = NET_RX_DROP; > >> return NULL; > >> /* used by tc_run */ > >> @@ -4054,7 +4057,10 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) > >> *ret = NET_XMIT_SUCCESS; > >> return NULL; > >> case TC_ACT_SHOT: > >> - kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS); > >> + case TC_ACT_ABORT: > >> + kfree_skb_reason(skb, likely(sch_ret == TC_ACT_SHOT) ? > >> + SKB_DROP_REASON_TC_EGRESS : > >> + SKB_DROP_REASON_TC_EGRESS_ERROR); > >> *ret = NET_XMIT_DROP; > >> return NULL; > >> /* used by tc_run */ > >> > >> Then you just return the internal TC_ACT_ABORT code for internal 'exceptions', > >> and you'll get the same result to make it observable for dropwatch. > >> > >> Thanks, > >> Daniel >