On Sat, 27 Feb 2021 00:36:02 +0100 Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > On 2/18/21 12:49 PM, Jesper Dangaard Brouer wrote: > > The FIB lookup example[1] show how the IP-header field tot_len > > (iph->tot_len) is used as input to perform the MTU check. > > > > This patch extend the BPF-helper bpf_check_mtu() with the same ability > > to provide the length as user parameter input, via mtu_len parameter. > > > > [1] samples/bpf/xdp_fwd_kern.c > > > > Signed-off-by: Jesper Dangaard Brouer <brouer@xxxxxxxxxx> > > Acked-by: John Fastabend <john.fastabend@xxxxxxxxx> > > --- > > include/uapi/linux/bpf.h | 17 +++++++++++------ > > net/core/filter.c | 12 ++++++++++-- > > 2 files changed, 21 insertions(+), 8 deletions(-) > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > index 4c24daa43bac..4ba4ef0ff63a 100644 > > --- a/include/uapi/linux/bpf.h > > +++ b/include/uapi/linux/bpf.h > > @@ -3850,8 +3850,7 @@ union bpf_attr { > > * > > * long bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags) > > * Description > > - > > - * Check ctx packet size against exceeding MTU of net device (based > > + * Check packet size against exceeding MTU of net device (based > > * on *ifindex*). This helper will likely be used in combination > > * with helpers that adjust/change the packet size. > > * > > @@ -3868,6 +3867,14 @@ union bpf_attr { > > * against the current net device. This is practical if this isn't > > * used prior to redirect. > > * > > + * On input *mtu_len* must be a valid pointer, else verifier will > > + * reject BPF program. If the value *mtu_len* is initialized to > > + * zero then the ctx packet size is use. When value *mtu_len* is > > + * provided as input this specify the L3 length that the MTU check > > + * is done against. Remember XDP and TC length operate at L2, but > > + * this value is L3 as this correlate to MTU and IP-header tot_len > > + * values which are L3 (similar behavior as bpf_fib_lookup). > > + * > > * The Linux kernel route table can configure MTUs on a more > > * specific per route level, which is not provided by this helper. > > * For route level MTU checks use the **bpf_fib_lookup**\ () > > @@ -3892,11 +3899,9 @@ union bpf_attr { > > * > > * On return *mtu_len* pointer contains the MTU value of the net > > * device. Remember the net device configured MTU is the L3 size, > > - * which is returned here and XDP and TX length operate at L2. > > + * which is returned here and XDP and TC length operate at L2. > > * Helper take this into account for you, but remember when using > > - * MTU value in your BPF-code. On input *mtu_len* must be a valid > > - * pointer and be initialized (to zero), else verifier will reject > > - * BPF program. > > + * MTU value in your BPF-code. > > * > > * Return > > * * 0 on success, and populate MTU value in *mtu_len* pointer. > > diff --git a/net/core/filter.c b/net/core/filter.c > > index 7059cf604d94..fcc3bda85960 100644 > > --- a/net/core/filter.c > > +++ b/net/core/filter.c > > @@ -5660,7 +5660,7 @@ BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb, > > if (unlikely(flags & ~(BPF_MTU_CHK_SEGS))) > > return -EINVAL; > > > > - if (unlikely(flags & BPF_MTU_CHK_SEGS && len_diff)) > > + if (unlikely(flags & BPF_MTU_CHK_SEGS && (len_diff || *mtu_len))) > > return -EINVAL; > > > > dev = __dev_via_ifindex(dev, ifindex); > > @@ -5670,7 +5670,11 @@ BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb, > > mtu = READ_ONCE(dev->mtu); > > > > dev_len = mtu + dev->hard_header_len; > > - skb_len = skb->len + len_diff; /* minus result pass check */ > > + > > + /* If set use *mtu_len as input, L3 as iph->tot_len (like fib_lookup) */ > > + skb_len = *mtu_len ? *mtu_len + dev->hard_header_len : skb->len; > > + > > + skb_len += len_diff; /* minus result pass check */ > > if (skb_len <= dev_len) { > > ret = BPF_MTU_CHK_RET_SUCCESS; > > goto out; > > @@ -5715,6 +5719,10 @@ BPF_CALL_5(bpf_xdp_check_mtu, struct xdp_buff *, xdp, > > /* Add L2-header as dev MTU is L3 size */ > > dev_len = mtu + dev->hard_header_len; > > > > + /* Use *mtu_len as input, L3 as iph->tot_len (like fib_lookup) */ > > + if (*mtu_len) > > + xdp_len = *mtu_len + dev->hard_header_len; > > + > > xdp_len += len_diff; /* minus result pass check */ > > if (xdp_len > dev_len) > > ret = BPF_MTU_CHK_RET_FRAG_NEEDED; > > > > Btw, one more note on the whole bpf_*_check_mtu() helper... Last week I implemented > PMTU discovery support for clients for Cilium's XDP stand-alone LB in DSR mode, so I > was briefly considering whether to use the bpf_xdp_check_mtu() helper for retrieving > the device MTU, but then I thought to myself why having an unnecessary per-packet cost > for an extra helper call if I could just pass it in via constant instead. So I went > with the latter instead of the helper with the tradeoff to restart the Cilium agent > if someone actually changes MTU in prod which is a rare event anyway. > > Looking at what bpf_xdp_check_mtu() for example really offers is retrieval of dev->mtu > as well as dev->hard_header_len and the rest can all be done inside the BPF prog itself > w/o the helper overhead. Why am I mentioning this.. because the above change is a similar > case of what could have been done /inside/ the BPF prog anyway (especially on XDP where > extra overhead should be cut where possible). The XDP case looks super simple now, but I thinking ahead. When Lorenzo adds multi-buff support, then we can/must update this helper to use another XDP length value, based on the multi-buff jumbo-frame len. Maybe we need another helper or what you propose below. BUT we could also allow this helper (via flag?) to ALSO check if dev support multi-buff XDP transmit (besides MTU limit with multi-buff len). Then the BPF-programmer can know this packet cannot be redirected to the device. > I think it got lost somewhere in the many versions of the original set where it was > mentioned before, but allowing to retrieve the dev object into BPF context and then > exposing it similarly to how we handle the case of struct bpf_tcp_sock would have been > much cleaner approach, e.g. the prog from XDP and tc context would be able to do: > > struct bpf_dev *dev = ctx->dev; > > And we expose initially, for example: > > struct bpf_dev { > __u32 mtu; > __u32 hard_header_len; > __u32 ifindex; > __u32 rx_queues; > __u32 tx_queues; > }; > > And we could also have a BPF helper for XDP and tc that would fetch a /different/ dev > given we're under RCU context anyway, like ... > > BPF_CALL_2(bpf_get_dev, struct xdp_buff *, xdp, u32, ifindex) > { > return dev_get_by_index_rcu(dev_net(xdp->rxq->dev), index); > } > > ... returning a new dev_or_null type. With this flexibility everything else can be done > inside the prog, and later on it easily allows to expose more from dev side. Actually, > I'm inclined to code it up ... I love the idea to retrieve the dev object into BPF context. It is orthogonal, and doesn't replace the MTU helpers as the packet ctx objects (SKB and xdp_buff) are more complex, and the helper allows us to extend them without users have to update their BPF-code (as desc above). I do think it makes a lot of sense to expose/retrieve dev object into BPF context. As I hinted about, when we implement XDP multi-buff, then the bpf_redirect BPF-helper cannot check if the remote device support multi-buff transmit (as it don't have packet ctx). If we have the dev object, the we could expose XDP features that allow us (BPF-programmer) to check this prior to doing the redirect. To be clear: * I still think *this* patch is relevant and should be applied. I'm also on-board with retrieve the dev object into BPF context, as it have other use-cases. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer