The MTU should only apply to transmitted packets. When TC-ingress redirect packet to egress on another netdev, then the normal netstack MTU checks are skipped (and driver level will not catch any MTU violation, checked ixgbe). This patch choose not to add MTU check in the egress code path of skb_do_redirect() prior to calling dev_queue_xmit(), because it is still possible to run another BPF egress program that will shrink/consume headers, which will make packet comply with netdev MTU. This use-case might already be in production use (if ingress MTU is larger than egress). Instead do the MTU check after sch_handle_egress() step, for the cases that require this. The cases need a bit explaining. Ingress to egress redirected packets could be detected via skb->tc_at_ingress bit, but it is not reliable, because sch_handle_egress() could steal the packet and redirect this (again) to another egress netdev, which will then have the skb->tc_at_ingress cleared. There is also the case of TC-egress prog increase packet size and then redirect it egress. Thus, it is more reliable to do the MTU check for any redirected packet (both ingress and egress), which is available via skb_is_redirected() in earlier patch. Also handle case where egress BPF-prog increased size. One advantage of this approach is that it ingress-to-egress BPF-prog can send information via packet data. With the MTU checks removed in the helpers, and also not done in skb_do_redirect() call, this allows for an ingress BPF-prog to communicate with an egress BPF-prog via packet data, as long as egress BPF-prog remove this prior to transmitting packet. Troubleshooting: MTU violations are recorded in TX dropped counter, and kprobe on dev_queue_xmit() have retval -EMSGSIZE. Signed-off-by: Jesper Dangaard Brouer <brouer@xxxxxxxxxx> --- net/core/dev.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index b433098896b2..33529022b30d 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3870,6 +3870,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) switch (tcf_classify(skb, miniq->filter_list, &cl_res, false)) { case TC_ACT_OK: case TC_ACT_RECLASSIFY: + *ret = NET_XMIT_SUCCESS; skb->tc_index = TC_H_MIN(cl_res.classid); break; case TC_ACT_SHOT: @@ -4064,9 +4065,10 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) { struct net_device *dev = skb->dev; struct netdev_queue *txq; + bool mtu_check = false; + bool again = false; struct Qdisc *q; int rc = -ENOMEM; - bool again = false; skb_reset_mac_header(skb); @@ -4082,14 +4084,28 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) qdisc_pkt_len_init(skb); #ifdef CONFIG_NET_CLS_ACT + mtu_check = skb_is_redirected(skb); skb->tc_at_ingress = 0; # ifdef CONFIG_NET_EGRESS if (static_branch_unlikely(&egress_needed_key)) { + unsigned int len_orig = skb->len; + skb = sch_handle_egress(skb, &rc, dev); if (!skb) goto out; + /* BPF-prog ran and could have changed packet size beyond MTU */ + if (rc == NET_XMIT_SUCCESS && skb->len > len_orig) + mtu_check = true; } # endif + /* MTU-check only happens on "last" net_device in a redirect sequence + * (e.g. above sch_handle_egress can steal SKB and skb_do_redirect it + * either ingress or egress to another device). + */ + if (mtu_check && !is_skb_forwardable(dev, skb)) { + rc = -EMSGSIZE; + goto drop; + } #endif /* If device/qdisc don't need skb->dst, release it right now while * its hot in this cpu cache. @@ -4157,7 +4173,7 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) rc = -ENETDOWN; rcu_read_unlock_bh(); - +drop: atomic_long_inc(&dev->tx_dropped); kfree_skb_list(skb); return rc;