Re: BPF redirect API design issue for BPF-prog MTU feedback?

Jesper Dangaard Brouer <brouer@xxxxxxxxxx> · Fri, 18 Sep 2020 12:00:16 +0200

On Thu, 17 Sep 2020 12:11:33 -0700
Saeed Mahameed <saeed@xxxxxxxxxx> wrote:

> On Thu, 2020-09-17 at 05:54 -0700, Maciej Żenczykowski wrote:
> > On Thu, Sep 17, 2020 at 5:39 AM Jesper Dangaard Brouer
> > <brouer@xxxxxxxxxx> wrote:  
> > > 
> > > As you likely know[1] I'm looking into moving the MTU check (for
> > > TC-BPF) in __bpf_skb_max_len() when e.g. called by
> > > bpf_skb_adjust_room(), because when redirecting packets to
> > > another netdev it is not correct to limit the MTU based on the
> > > incoming netdev.
> > > 
> > > I was looking at doing the MTU check in bpf_redirect() helper,
> > > because at this point we know the redirect to netdev, and
> > > returning an indication/error that MTU was exceed, would allow
> > > the BPF-prog logic to react, e.g. sending ICMP (instead of packet
> > > getting silently dropped). 
> > > BUT this is not possible because bpf_redirect(index, flags) helper
> > > don't provide the packet context-object (so I cannot lookup the
> > > packet length).
> > > 
> > > Seeking input:
> > > 
> > > Should/can we change the bpf_redirect API or create a new helper
> > > with packet-context?
> > > 
> > >  Note: We have the same need for the packet context for XDP when
> > >  redirecting the new multi-buffer packets, as not all destination
> > >  netdev will support these new multi-buffer packets.
> > > 
> > > I can of-cause do the MTU checks on kernel-side in
> > > skb_do_redirect, but then how do people debug this? as packet
> > > will basically be silently dropped.
> > > 
> > > 
> > > 
> > > (Looking at how does BPF-prog logic handle MTU today)
> > > 
> > > How do bpf_skb_adjust_room() report that the MTU was exceeded?
> > > Unfortunately it uses a common return code -ENOTSUPP which used
> > > for multiple cases (include MTU exceeded). Thus, the BPF-prog
> > > logic cannot use this reliably to know if this is a MTU exceeded
> > > event. (Looked BPF-prog code and they all simply exit with
> > > TC_ACT_SHOT for all error codes, cloudflare have the most
> > > advanced handling with metrics->errors_total_encap_adjust_failed++).
> > > 
> > > 
> > > [1] 
> > > https://lore.kernel.org/bpf/159921182827.1260200.9699352760916903781.stgit@firesoul/
> > > --
> > > Best regards,
> > >   Jesper Dangaard Brouer
> > >   MSc.CS, Principal Kernel Engineer at Red Hat
> > >   LinkedIn: http://www.linkedin.com/in/brouer
> > >   
> > 
> > (a) the current state of the world seems very hard to use correctly,
> > so adding new apis, or even changing existing ones seems ok to me.
> > especially if this just means changing what error code they return
> > 
> > (b) another complexity with bpf_redirect() is you can call it, it
> > can succeed, but then you can not return TC_ACT_REDIRECT from the
> > bpf program, which effectively makes the earlier *successful*
> > bpf_redirect() call an utter no-op.
> > 
> > (bpf_redirect() just determines what a future return TC_ACT_REDIRECT
> > will do)
> > 
> > so if you bpf_redirect to interface with larger mtu, then increase
> > packet size,  
> 
> why would you redirect then touch the packet afterwards ? 
> if you have a bad program, then it is a user issue.
> 
> > then return TC_ACT_OK, then you potentially end up with excessively
> > large packet egressing through original interface (with small mtu).
> > 

This is a good point.  As bpf_skb_adjust_room() can just be run after
bpf_redirect() call, then a MTU check in bpf_redirect() actually
doesn't make much sense.  As clever/bad BPF program can then avoid the
MTU check anyhow.  This basically means that we have to do the MTU
check (again) on kernel side anyhow to catch such clever/bad BPF
programs.  (And I don't like wasting cycles on doing the same check two
times).

If we do the MTU check on the kernel side, then there are no feedback
to the program, and how are end-users going to debug this?

> > My vote would be to return a new distinct error from bpf_redirect()
> > based on then current packet size and interface being redirected
> > to, save this interface mtu  somewhere, then in operations that
> > increase packet size check against this saved mtu, for correctness
> > you still have to check mtu after the bpf program is done,
> > but this is then just to deal with braindead bpf code (that calls
> > bpf_redirect and returns TC_ACT_OK, or calls bpf_redirect() multiple
> > times, or something...).
> >   

Hmm, I don't like wasting cycles on doing the same check multiple times.

In bpf_redirect() we store the netdev we want to redirect to, thus, if
bpf_skb_adjust_room() is run after bpf_redirect(), we could also do a
MTU check in bpf_skb_adjust_room() based on this netdev. (maybe there
are still ways for a BPF-prog to cheat the MTU check?).

> 
> Another solution is to have an exception function defined in the
> BPF_prog, this function by itself is another program that can be
> executed to notify the prog about any exception/err that happened
> after the main BPF_program exited and let the XDP program react by
> its own logic.

We are talking about TC-BPF programs here, but the concept and
usability issue with bpf_redirect() is the same for XDP.

If doing the MTU check (and drop) on kernel side, as was thinking about
adding a simple tracepoint, for end-users to debug this. It would also
allow for attaching a BPF-prog to the tracepoint to get more direct
feedback, but it would not allow sending back an ICMP response.

Your approach of calling a 2nd BPF-prog to deal with exceptions, and
allow it to alter the packet and flow "action" is definitely
interesting.  What do others think?

> example:
> 
> BPF_prog:
>     int XDP_main_prog(xdp_buff) {
>         xdp_adjust_head/tail(xdp_buff);
>         return xdp_redirect(ifindex, flags);
>     }
> 
>     int XDP_exception(xdp_buff, excption_code) {
>         if (excetption_code == XDP_REDIRECRT_MTU_EXCEEDED) {
>                 ICMP_response(xdp_buff);
>                 return XDP_TX;
>         }
>         return XDP_DROP;
>     }
> 
> 
> netdev_driver_xdp_handle():
>    act = bpf_prog_run_xdp(prog, xdp); // Run XDP_main_prog
>    if (act == XDP_REDIRECT)
>        err = xdp_do_redirect(netdev, xdp, prog);
>        if (err) { 
>           // Run XDP_exception() function in the user prog
>           // finds the exception handler of active program
>           act = bpf_prog_run_xdp_exciption(prog, xdp, err);
>           // then handle exception action in the driver
> (XDP_TX/DROP/FORWARD).. 
>        }
> 
> of-course a user program will be notified only on the first err .. 
> if it fails on the 2nd time .. just drop..

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer