Re: [RFC net-next 1/9] skb: introduce gro_disabled bit

Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx> · Sun, 23 Jun 2024 04:23:39 -0400

Yan Zhai wrote:
> On Fri, Jun 21, 2024 at 11:41 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
> >
> > On 6/21/24 6:00 PM, Yan Zhai wrote:
> > > On Fri, Jun 21, 2024 at 8:13 AM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
> > >> On 6/21/24 2:15 PM, Willem de Bruijn wrote:
> > >>> Yan Zhai wrote:
> > >>>> Software GRO is currently controlled by a single switch, i.e.
> > >>>>
> > >>>>     ethtool -K dev gro on|off
> > >>>>
> > >>>> However, this is not always desired. When GRO is enabled, even if the
> > >>>> kernel cannot GRO certain traffic, it has to run through the GRO receive
> > >>>> handlers with no benefit.
> > >>>>
> > >>>> There are also scenarios that turning off GRO is a requirement. For
> > >>>> example, our production environment has a scenario that a TC egress hook
> > >>>> may add multiple encapsulation headers to forwarded skbs for load
> > >>>> balancing and isolation purpose. The encapsulation is implemented via
> > >>>> BPF. But the problem arises then: there is no way to properly offload a
> > >>>> double-encapsulated packet, since skb only has network_header and
> > >>>> inner_network_header to track one layer of encapsulation, but not two.
> > >>>> On the other hand, not all the traffic through this device needs double
> > >>>> encapsulation. But we have to turn off GRO completely for any ingress
> > >>>> device as a result.
> > >>>>
> > >>>> Introduce a bit on skb so that GRO engine can be notified to skip GRO on
> > >>>> this skb, rather than having to be 0-or-1 for all traffic.
> > >>>>
> > >>>> Signed-off-by: Yan Zhai <yan@xxxxxxxxxxxxxx>
> > >>>> ---
> > >>>>    include/linux/netdevice.h |  9 +++++++--
> > >>>>    include/linux/skbuff.h    | 10 ++++++++++
> > >>>>    net/Kconfig               | 10 ++++++++++
> > >>>>    net/core/gro.c            |  2 +-
> > >>>>    net/core/gro_cells.c      |  2 +-
> > >>>>    net/core/skbuff.c         |  4 ++++
> > >>>>    6 files changed, 33 insertions(+), 4 deletions(-)
> > >>>>
> > >>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > >>>> index c83b390191d4..2ca0870b1221 100644
> > >>>> --- a/include/linux/netdevice.h
> > >>>> +++ b/include/linux/netdevice.h
> > >>>> @@ -2415,11 +2415,16 @@ struct net_device {
> > >>>>       ((dev)->devlink_port = (port));                         \
> > >>>>    })
> > >>>>
> > >>>> -static inline bool netif_elide_gro(const struct net_device *dev)
> > >>>> +static inline bool netif_elide_gro(const struct sk_buff *skb)
> > >>>>    {
> > >>>> -    if (!(dev->features & NETIF_F_GRO) || dev->xdp_prog)
> > >>>> +    if (!(skb->dev->features & NETIF_F_GRO) || skb->dev->xdp_prog)
> > >>>>               return true;
> > >>>> +
> > >>>> +#ifdef CONFIG_SKB_GRO_CONTROL
> > >>>> +    return skb->gro_disabled;
> > >>>> +#else
> > >>>>       return false;
> > >>>> +#endif
> > >>>
> > >>> Yet more branches in the hot path.
> > >>>
> > >>> Compile time configurability does not help, as that will be
> > >>> enabled by distros.
> > >>>
> > >>> For a fairly niche use case. Where functionality of GRO already
> > >>> works. So just a performance for a very rare case at the cost of a
> > >>> regression in the common case. A small regression perhaps, but death
> > >>> by a thousand cuts.
> > >>
> > >> Mentioning it here b/c it perhaps fits in this context, longer time ago
> > >> there was the idea mentioned to have BPF operating as GRO engine which
> > >> might also help to reduce attack surface by only having to handle packets
> > >> of interest for the concrete production use case. Perhaps here meta data
> > >> buffer could be used to pass a notification from XDP to exit early w/o
> > >> aggregation.
> > >
> > > Metadata is in fact one of our interests as well. We discussed using
> > > metadata instead of a skb bit to carry this information internally.
> > > Since metadata is opaque atm so it seems the only option is to have a
> > > GRO control hook before napi_gro_receive, and let BPF decide
> > > netif_receive_skb or napi_gro_receive (echo what Paolo said). With BPF
> > > it could indeed be more flexible, but the cons is that it could be
> > > even more slower than taking a bit on skb. I am actually open to
> > > either approach, as long as it gives us more control on when to enable
> > > GRO :)
> >
> > Oh wait, one thing that just came to mind.. have you tried u64 per-CPU
> > counter map in XDP? For packets which should not be GRO-aggregated you
> > add count++ into the meta data area, and this forces GRO to not aggregate
> > since meta data that needs to be transported to tc BPF layer mismatches
> > (and therefore the contract/intent is that tc BPF needs to see the different
> > meta data passed to it).
> >
> 
> We did this before accidentally (we put a timestamp for debugging
> purposes in metadata) and this actually caused about 20% of OoO for
> TCP in production: all PSH packets are reordered. GRO does not fire
> the packet to the upper layer when a diff in metadata is found for a
> non-PSH packet, instead it is queued as a “new flow” on the GRO list
> and waits for flushing. When a PSH packet arrives, its semantic is to
> flush this packet immediately and thus precedes earlier packets of the
> same flow.

Is that a bug in XDP metadata handling for GRO?

Mismatching metadata should not be taken as separate flows, but as a
flush condition.