On Fri, Jun 21, 2024 at 8:13 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > > On 6/21/24 2:15 PM, Willem de Bruijn wrote: > > Yan Zhai wrote: > >> Software GRO is currently controlled by a single switch, i.e. > >> > >> ethtool -K dev gro on|off > >> > >> However, this is not always desired. When GRO is enabled, even if the > >> kernel cannot GRO certain traffic, it has to run through the GRO receive > >> handlers with no benefit. > >> > >> There are also scenarios that turning off GRO is a requirement. For > >> example, our production environment has a scenario that a TC egress hook > >> may add multiple encapsulation headers to forwarded skbs for load > >> balancing and isolation purpose. The encapsulation is implemented via > >> BPF. But the problem arises then: there is no way to properly offload a > >> double-encapsulated packet, since skb only has network_header and > >> inner_network_header to track one layer of encapsulation, but not two. > >> On the other hand, not all the traffic through this device needs double > >> encapsulation. But we have to turn off GRO completely for any ingress > >> device as a result. > >> > >> Introduce a bit on skb so that GRO engine can be notified to skip GRO on > >> this skb, rather than having to be 0-or-1 for all traffic. > >> > >> Signed-off-by: Yan Zhai <yan@xxxxxxxxxxxxxx> > >> --- > >> include/linux/netdevice.h | 9 +++++++-- > >> include/linux/skbuff.h | 10 ++++++++++ > >> net/Kconfig | 10 ++++++++++ > >> net/core/gro.c | 2 +- > >> net/core/gro_cells.c | 2 +- > >> net/core/skbuff.c | 4 ++++ > >> 6 files changed, 33 insertions(+), 4 deletions(-) > >> > >> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > >> index c83b390191d4..2ca0870b1221 100644 > >> --- a/include/linux/netdevice.h > >> +++ b/include/linux/netdevice.h > >> @@ -2415,11 +2415,16 @@ struct net_device { > >> ((dev)->devlink_port = (port)); \ > >> }) > >> > >> -static inline bool netif_elide_gro(const struct net_device *dev) > >> +static inline bool netif_elide_gro(const struct sk_buff *skb) > >> { > >> - if (!(dev->features & NETIF_F_GRO) || dev->xdp_prog) > >> + if (!(skb->dev->features & NETIF_F_GRO) || skb->dev->xdp_prog) > >> return true; > >> + > >> +#ifdef CONFIG_SKB_GRO_CONTROL > >> + return skb->gro_disabled; > >> +#else > >> return false; > >> +#endif > > > > Yet more branches in the hot path. > > > > Compile time configurability does not help, as that will be > > enabled by distros. > > > > For a fairly niche use case. Where functionality of GRO already > > works. So just a performance for a very rare case at the cost of a > > regression in the common case. A small regression perhaps, but death > > by a thousand cuts. > > Mentioning it here b/c it perhaps fits in this context, longer time ago > there was the idea mentioned to have BPF operating as GRO engine which > might also help to reduce attack surface by only having to handle packets > of interest for the concrete production use case. Perhaps here meta data > buffer could be used to pass a notification from XDP to exit early w/o > aggregation. Metadata is in fact one of our interests as well. We discussed using metadata instead of a skb bit to carry this information internally. Since metadata is opaque atm so it seems the only option is to have a GRO control hook before napi_gro_receive, and let BPF decide netif_receive_skb or napi_gro_receive (echo what Paolo said). With BPF it could indeed be more flexible, but the cons is that it could be even more slower than taking a bit on skb. I am actually open to either approach, as long as it gives us more control on when to enable GRO :) To extend the discussion a bit, putting GRO aside, I think some common hook before GRO would be still valuable moving forward: it is a limited window where the driver code has both access to XDP context and skb. Today we do not have a good way to transfer HW offloading info to skbs if XDP redirect-to-cpu or if XDP encap-and-tx for load balancing purposes. The XDP metadata infrastructure already allows XDP to read this information with driver supports, so to complete that, a place to use it (which I introduced as xdp_buff/frame_fixup_skb_offloading in a later patch) would be beneficial to pass on things like the flow hash, vlan information, etc. best Yan