Re: [PATCH net-next 1/1] net/sched: Disambiguate verdict from return code

Jamal Hadi Salim <jhs@xxxxxxxxxxxx> · Tue, 3 Oct 2023 17:36:33 -0400

On Tue, Oct 3, 2023 at 9:49 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
>
> On 10/3/23 2:46 PM, Jamal Hadi Salim wrote:
> > On Tue, Oct 3, 2023 at 5:00 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
> >> On 10/2/23 9:54 PM, Jamal Hadi Salim wrote:
> >>> On Fri, Sep 29, 2023 at 11:48 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
> >>>> On 9/26/23 1:01 AM, Jamal Hadi Salim wrote:
> >>>>> On Fri, Sep 22, 2023 at 4:12 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
> >>>>>> On 9/20/23 1:20 AM, Jamal Hadi Salim wrote:
> >>>>>>> On Tue, Sep 19, 2023 at 6:15 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
> >>>>>>>> On 9/19/23 4:59 PM, Victor Nogueira wrote:
> >>>> [...]
> >>>>>>
> >>>>>> In the above case we don't have 'internal' errors which you want to trace, so I would
> >>>>>> also love to avoid the cost of zeroing struct tcf_result res which should be 3x 8b for
> >>>>>> every packet.
> >>>>>
> >>>>> We can move the zeroing inside tc_run() but we declare it in the same
> >>>>> spot as we do right now. You will still need to set res.verdict as
> >>>>> above.
> >>>>> Would that work for you?
> >>>>
> >>>> What I'm not following is that with the below you can avoid the unnecessary
> >>>> fast path cost (which is only for corner case which is almost never hit) and
> >>>> get even better visibility. Are you saying it doesn't work?
> >>>
> >>> I am probably missing something:
> >>> -1/UNSPEC is a legit errno. And the main motivation here for this
> >>> patch is to disambiguate if it was -EPERM vs UNSPEC
> >>> Maybe that is what you are calling a "corner case"?
> >>
> >> Yes, but what is the use-case to ever return a -EPERM from the fast-path? This can
> >> be audited for the code in the tree and therefore avoided so that you never run into
> >> this problem.
> >
> > I am sorry but i am not in favor of this approach.
> > You are suggesting audits are the way to go forward when in fact lack
> > of said audits is what got us in this trouble with syzkaller to begin
> > with. We cant rely on tribal knowledge to be able to spot these
> > discrepancies. The elder of the tribe may move to a different mountain
> > at some point and TheLinuxWay(tm) is cutnpaste, so i dont see this as
> > long term good for maintainance. We have a clear distinction between
> > an error vs verdict - lets use that.
> > We really dont want to make this a special case just for eBPF and how
> > to make it a happy world for eBPF at the cost of everyone else. I made
> > a suggestion of leaving tcx alone, you can do your own thing there;
> > but for tc_run my view is we should keep it generic.
>
> Jamal, before you come to early conclusions, it would be great if you also
> read until the end of the email, because what I suggested below *is* generic
> and with less churn throughout the code base.
>

I did look, Daniel. You are lumping all the error codes into one -
which doesnt change my view on disambiguation. If i was to debug
closely and run kprobe now i am seeing only one error code
TC_ACT_ABORT instead of -EINVAL vs -ENOMEM, etc. Easier for me to find
the source manually (and possibly even better with Andrii's tool i saw
once if it would work in the datapath - iirc, i think it prints return
codes on the code paths).

cheers,
jamal

> >>> There are two options in my mind right now (since you are guaranteed
> >>> in tcx_run you will never return anything below UNSPEC):
> >>> 1) we just have the switch statement invocation inside an inline
> >>> function and you can pass it sch_ret (for tcx case) and we'll pass it
> >>> res.verdit for tc_run() case.
> >>> 2) is something is we leave tcx_run alone and we have something along
> >>> the lines of:
> >>>
> >>> --------------
> >>> diff --git a/net/core/dev.c b/net/core/dev.c
> >>> index 1450f4741d9b..93613bce647c 100644
> >>> --- a/net/core/dev.c
> >>> +++ b/net/core/dev.c
> >>> @@ -3985,7 +3985,7 @@ sch_handle_ingress(struct sk_buff *skb, struct
> >>> packet_type **pt_prev, int *ret,
> >>>                      struct net_device *orig_dev, bool *another)
> >>>    {
> >>>           struct bpf_mprog_entry *entry =
> >>> rcu_dereference_bh(skb->dev->tcx_ingress);
> >>> -       struct tcf_result res = {0};
> >>> +       struct tcf_result res;
> >>>           int sch_ret;
> >>>
> >>>           if (!entry)
> >>> @@ -4003,14 +4003,16 @@ sch_handle_ingress(struct sk_buff *skb, struct
> >>> packet_type **pt_prev, int *ret,
> >>>                   if (sch_ret != TC_ACT_UNSPEC)
> >>>                           goto ingress_verdict;
> >>>           }
> >>> +
> >>> +       res.verdict = 0;
> >>>           sch_ret = tc_run(tcx_entry(entry), skb, &res);
> >>>           if (sch_ret < 0) {
> >>>                   kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS_ERROR);
> >>>                   *ret = NET_RX_DROP;
> >>>                   return NULL;
> >>>           }
> >>> +       sch_ret = res.verdict;
> >>>    ingress_verdict:
> >>> -       switch (res.verdict) {
> >>> +       switch (sch_ret) {
> >>>           case TC_ACT_REDIRECT:
> >>>                   /* skb_mac_header check was done by BPF, so we can
> >>> safely
> >>>                    * push the L2 header back before redirecting to another
> >>> -----------
> >>>
> >>> on the drop reason - our thinking is to support drop_watch alongside
> >>> tracepoint given kfree_skb_reason exists already; if i am not mistaken
> >>> what you suggested would require us to create a new tracepoint?
> >>
> >> So if the only thing you really care about is the different drop reason for
> >> kfree_skb_reason, then I still don't follow why you need to drag this into
> >> struct tcf_result. This can be done in a much simpler and more efficient way
> >> like the following:
> >>
> >> diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h
> >> index a587e83fc169..b1c069c8e7f2 100644
> >> --- a/include/net/dropreason-core.h
> >> +++ b/include/net/dropreason-core.h
> >> @@ -80,6 +80,8 @@
> >>          FN(IPV6_NDISC_BAD_OPTIONS)      \
> >>          FN(IPV6_NDISC_NS_OTHERHOST)     \
> >>          FN(QUEUE_PURGE)                 \
> >> +       FN(TC_EGRESS_ERROR)             \
> >> +       FN(TC_INGRESS_ERROR)            \
> >>          FNe(MAX)
> >>
> >>    /**
> >> @@ -345,6 +347,10 @@ enum skb_drop_reason {
> >>          SKB_DROP_REASON_IPV6_NDISC_NS_OTHERHOST,
> >>          /** @SKB_DROP_REASON_QUEUE_PURGE: bulk free. */
> >>          SKB_DROP_REASON_QUEUE_PURGE,
> >> +       /** @SKB_DROP_REASON_TC_EGRESS_ERROR: dropped in TC egress HOOK due to error */
> >> +       SKB_DROP_REASON_TC_EGRESS_ERROR,
> >> +       /** @SKB_DROP_REASON_TC_INGRESS_ERROR: dropped in TC ingress HOOK due to error */
> >> +       SKB_DROP_REASON_TC_INGRESS_ERROR,
> >>          /**
> >>           * @SKB_DROP_REASON_MAX: the maximum of core drop reasons, which
> >>           * shouldn't be used as a real 'reason' - only for tracing code gen
> >> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
> >> index f308e8268651..cd2444dd3745 100644
> >> --- a/include/net/pkt_cls.h
> >> +++ b/include/net/pkt_cls.h
> >> @@ -10,6 +10,7 @@
> >>
> >>    /* TC action not accessible from user space */
> >>    #define TC_ACT_CONSUMED               (TC_ACT_VALUE_MAX + 1)
> >> +#define TC_ACT_ABORT           (TC_ACT_VALUE_MAX + 2)
> >>
> >>    /* Basic packet classifier frontend definitions. */
> >>
> >> diff --git a/net/core/dev.c b/net/core/dev.c
> >> index 85df22f05c38..3abb4d71c170 100644
> >> --- a/net/core/dev.c
> >> +++ b/net/core/dev.c
> >> @@ -4011,7 +4011,10 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
> >>                  *ret = NET_RX_SUCCESS;
> >>                  return NULL;
> >>          case TC_ACT_SHOT:
> >> -               kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS);
> >> +       case TC_ACT_ABORT:
> >> +               kfree_skb_reason(skb, likely(sch_ret == TC_ACT_SHOT) ?
> >> +                                SKB_DROP_REASON_TC_INGRESS :
> >> +                                SKB_DROP_REASON_TC_INGRESS_ERROR);
> >>                  *ret = NET_RX_DROP;
> >>                  return NULL;
> >>          /* used by tc_run */
> >> @@ -4054,7 +4057,10 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
> >>                  *ret = NET_XMIT_SUCCESS;
> >>                  return NULL;
> >>          case TC_ACT_SHOT:
> >> -               kfree_skb_reason(skb, SKB_DROP_REASON_TC_EGRESS);
> >> +       case TC_ACT_ABORT:
> >> +               kfree_skb_reason(skb, likely(sch_ret == TC_ACT_SHOT) ?
> >> +                                SKB_DROP_REASON_TC_EGRESS :
> >> +                                SKB_DROP_REASON_TC_EGRESS_ERROR);
> >>                  *ret = NET_XMIT_DROP;
> >>                  return NULL;
> >>          /* used by tc_run */
> >>
> >> Then you just return the internal TC_ACT_ABORT code for internal 'exceptions',
> >> and you'll get the same result to make it observable for dropwatch.
> >>
> >> Thanks,
> >> Daniel
>