Re: [PATCH net-next v8 2/3] net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment

Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx> · Mon, 06 May 2024 13:20:14 -0400

Richard Gobert wrote:
> {inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
> iph->id, ...) against all packets in a loop. These flush checks are used in
> all merging UDP and TCP flows.
> 
> These checks need to be done only once and only against the found p skb,
> since they only affect flush and not same_flow.
> 
> This patch leverages correct network header offsets from the cb for both
> outer and inner network headers - allowing these checks to be done only
> once, in tcp_gro_receive and udp_gro_receive_segment. As a result,
> NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id checks are
> more declarative and contained in inet_gro_flush, thus removing the need
> for flush_id in napi_gro_cb.
> 
> This results in less parsing code for non-loop flush tests for TCP and UDP
> flows.
> 
> To make sure results are not within noise range - I've made netfilter drop
> all TCP packets, and measured CPU performance in GRO (in this case GRO is
> responsible for about 50% of the CPU utilization).
> 
> perf top while replaying 64 parallel IP/TCP streams merging in GRO:
> (gro_network_flush is compiled inline to tcp_gro_receive)
> net-next:
>         6.94% [kernel] [k] inet_gro_receive
>         3.02% [kernel] [k] tcp_gro_receive
> 
> patch applied:
>         4.27% [kernel] [k] tcp_gro_receive
>         4.22% [kernel] [k] inet_gro_receive
> 
> perf top while replaying 64 parallel IP/IP/TCP streams merging in GRO (same
> results for any encapsulation, in this case inet_gro_receive is top
> offender in net-next)
> net-next:
>         10.09% [kernel] [k] inet_gro_receive
>         2.08% [kernel] [k] tcp_gro_receive
> 
> patch applied:
>         6.97% [kernel] [k] inet_gro_receive
>         3.68% [kernel] [k] tcp_gro_receive

Thanks for getting the additional numbers. The savings are not huge.

But +1 on the change also because it simplifies this non-obvious
logic. It makes sense to separate flow matching and flush logic.

Btw please include Alexander Duyck in the Cc: of this series. 
> +static inline int inet_gro_flush(const struct iphdr *iph, const struct iphdr *iph2,
> +				 struct sk_buff *p, bool outer)
> +{
> +	const u32 id = ntohl(*(__be32 *)&iph->id);
> +	const u32 id2 = ntohl(*(__be32 *)&iph2->id);
> +	const u16 flush_id = (id >> 16) - (id2 >> 16);
> +	const u16 count = NAPI_GRO_CB(p)->count;
> +	const u32 df = id & IP_DF;
> +	u32 is_atomic;
> +	int flush;
> +
> +	/* All fields must match except length and checksum. */
> +	flush = (iph->ttl ^ iph2->ttl) | (iph->tos ^ iph2->tos) | (df ^ (id2 & IP_DF));
> +
> +	if (outer && df)
> +		return flush;

Does the fixed id logic apply equally to inner and outer IPv4?

> +
> +	/* When we receive our second frame we can make a decision on if we
> +	 * continue this flow as an atomic flow with a fixed ID or if we use
> +	 * an incrementing ID.
> +	 */
> +	NAPI_GRO_CB(p)->is_atomic |= (count == 1 && df && flush_id == 0);
> +	is_atomic = (df && NAPI_GRO_CB(p)->is_atomic) - 1;
> +
> +	return flush | (flush_id ^ (count & is_atomic));

This is a good time to consider making this logical more obvious.

First off, the flush check can be part of the outer && df above, as
flush is not modified after.

Subjective, but I find the following more readable, and not worth
saving a few branches.

        if (count == 1 && df && !flush_id)
                NAPI_GRO_CB(p)->is_atomic = true;

	ip_fixedid_matches = NAPI_GRO_CB(p)->is_atomic ^ df;
	ipid_offset_matches = ipid_offset - count;

	return ip_fixedid_matches & ipid_offset_matches;

Have to be a bit careful about types. Have not checked that in detail.

And while nitpicking:
ipid_offset may be a more descriptive variable name than flush_id, and
ip_fixedid  than is_atomic. If changing those does not result in a lot
of code churn.

> +}
> +
> +static inline int ipv6_gro_flush(const struct ipv6hdr *iph, const struct ipv6hdr *iph2)
> +{
> +	/* <Version:4><Traffic_Class:8><Flow_Label:20> */
> +	__be32 first_word = *(__be32 *)iph ^ *(__be32 *)iph2;
> +
> +	/* Flush if Traffic Class fields are different. */
> +	return !!((first_word & htonl(0x0FF00000)) |
> +		(__force __be32)(iph->hop_limit ^ iph2->hop_limit));
> +}
> +
> +static inline int gro_network_flush(const void *th, const void *th2, struct sk_buff *p, int off)
> +{
> +	const bool encap_mark = NAPI_GRO_CB(p)->encap_mark;

Is this correct when udp_gro_complete clears this for tunnels?

> +	int flush = 0;
> +	int i;
> +
> +	for (i = 0; i <= encap_mark; i++) {
> +		const u16 diff = off - NAPI_GRO_CB(p)->network_offsets[i];
> +		const void *nh = th - diff;
> +		const void *nh2 = th2 - diff;
> +
> +		if (((struct iphdr *)nh)->version == 6)
> +			flush |= ipv6_gro_flush(nh, nh2);
> +		else
> +			flush |= inet_gro_flush(nh, nh2, p, i != encap_mark);
> +	}

Maybe slightly better for branch prediction, and more obvious, if
creating a helper function __gro_network_flush and calling

    __gro_network_flush(th, th2, p, off - NAPI_GRO_CB(p)->network_offsets[0])
    if (NAPI_GRO_CB(p)->encap_mark)
            __gro_network_flush(th, th2, p, off - NAPI_GRO_CB(p)->network_offsets[1])

> +
> +	return flush;
> +}
> +
>  int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb);
>