Yan Zhai wrote: > On Fri, Jun 21, 2024 at 11:41 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > > > > On 6/21/24 6:00 PM, Yan Zhai wrote: > > > On Fri, Jun 21, 2024 at 8:13 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote: > > >> On 6/21/24 2:15 PM, Willem de Bruijn wrote: > > >>> Yan Zhai wrote: > > >>>> Software GRO is currently controlled by a single switch, i.e. > > >>>> > > >>>> ethtool -K dev gro on|off > > >>>> > > >>>> However, this is not always desired. When GRO is enabled, even if the > > >>>> kernel cannot GRO certain traffic, it has to run through the GRO receive > > >>>> handlers with no benefit. > > >>>> > > >>>> There are also scenarios that turning off GRO is a requirement. For > > >>>> example, our production environment has a scenario that a TC egress hook > > >>>> may add multiple encapsulation headers to forwarded skbs for load > > >>>> balancing and isolation purpose. The encapsulation is implemented via > > >>>> BPF. But the problem arises then: there is no way to properly offload a > > >>>> double-encapsulated packet, since skb only has network_header and > > >>>> inner_network_header to track one layer of encapsulation, but not two. > > >>>> On the other hand, not all the traffic through this device needs double > > >>>> encapsulation. But we have to turn off GRO completely for any ingress > > >>>> device as a result. > > >>>> > > >>>> Introduce a bit on skb so that GRO engine can be notified to skip GRO on > > >>>> this skb, rather than having to be 0-or-1 for all traffic. > > >>>> > > >>>> Signed-off-by: Yan Zhai <yan@xxxxxxxxxxxxxx> > > >>>> --- > > >>>> include/linux/netdevice.h | 9 +++++++-- > > >>>> include/linux/skbuff.h | 10 ++++++++++ > > >>>> net/Kconfig | 10 ++++++++++ > > >>>> net/core/gro.c | 2 +- > > >>>> net/core/gro_cells.c | 2 +- > > >>>> net/core/skbuff.c | 4 ++++ > > >>>> 6 files changed, 33 insertions(+), 4 deletions(-) > > >>>> > > >>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > > >>>> index c83b390191d4..2ca0870b1221 100644 > > >>>> --- a/include/linux/netdevice.h > > >>>> +++ b/include/linux/netdevice.h > > >>>> @@ -2415,11 +2415,16 @@ struct net_device { > > >>>> ((dev)->devlink_port = (port)); \ > > >>>> }) > > >>>> > > >>>> -static inline bool netif_elide_gro(const struct net_device *dev) > > >>>> +static inline bool netif_elide_gro(const struct sk_buff *skb) > > >>>> { > > >>>> - if (!(dev->features & NETIF_F_GRO) || dev->xdp_prog) > > >>>> + if (!(skb->dev->features & NETIF_F_GRO) || skb->dev->xdp_prog) > > >>>> return true; > > >>>> + > > >>>> +#ifdef CONFIG_SKB_GRO_CONTROL > > >>>> + return skb->gro_disabled; > > >>>> +#else > > >>>> return false; > > >>>> +#endif > > >>> > > >>> Yet more branches in the hot path. > > >>> > > >>> Compile time configurability does not help, as that will be > > >>> enabled by distros. > > >>> > > >>> For a fairly niche use case. Where functionality of GRO already > > >>> works. So just a performance for a very rare case at the cost of a > > >>> regression in the common case. A small regression perhaps, but death > > >>> by a thousand cuts. > > >> > > >> Mentioning it here b/c it perhaps fits in this context, longer time ago > > >> there was the idea mentioned to have BPF operating as GRO engine which > > >> might also help to reduce attack surface by only having to handle packets > > >> of interest for the concrete production use case. Perhaps here meta data > > >> buffer could be used to pass a notification from XDP to exit early w/o > > >> aggregation. > > > > > > Metadata is in fact one of our interests as well. We discussed using > > > metadata instead of a skb bit to carry this information internally. > > > Since metadata is opaque atm so it seems the only option is to have a > > > GRO control hook before napi_gro_receive, and let BPF decide > > > netif_receive_skb or napi_gro_receive (echo what Paolo said). With BPF > > > it could indeed be more flexible, but the cons is that it could be > > > even more slower than taking a bit on skb. I am actually open to > > > either approach, as long as it gives us more control on when to enable > > > GRO :) > > > > Oh wait, one thing that just came to mind.. have you tried u64 per-CPU > > counter map in XDP? For packets which should not be GRO-aggregated you > > add count++ into the meta data area, and this forces GRO to not aggregate > > since meta data that needs to be transported to tc BPF layer mismatches > > (and therefore the contract/intent is that tc BPF needs to see the different > > meta data passed to it). > > > > We did this before accidentally (we put a timestamp for debugging > purposes in metadata) and this actually caused about 20% of OoO for > TCP in production: all PSH packets are reordered. GRO does not fire > the packet to the upper layer when a diff in metadata is found for a > non-PSH packet, instead it is queued as a “new flow” on the GRO list > and waits for flushing. When a PSH packet arrives, its semantic is to > flush this packet immediately and thus precedes earlier packets of the > same flow. Is that a bug in XDP metadata handling for GRO? Mismatching metadata should not be taken as separate flows, but as a flush condition.