Re: [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()

Alexander Lobakin <aleksander.lobakin@xxxxxxxxx> · Mon, 19 Aug 2024 16:50:52 +0200

From: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
Date: Tue, 13 Aug 2024 17:57:44 +0200

> 
> 
> On 13/08/2024 16.54, Toke Høiland-Jørgensen wrote:
>> Alexander Lobakin <aleksander.lobakin@xxxxxxxxx> writes:
>>
>>> From: Alexander Lobakin <aleksander.lobakin@xxxxxxxxx>
>>> Date: Thu, 8 Aug 2024 13:57:00 +0200
>>>
>>>> From: Lorenzo Bianconi <lorenzo.bianconi@xxxxxxxxxx>
>>>> Date: Thu, 8 Aug 2024 06:54:06 +0200
>>>>
>>>>>> Hi Alexander,

[...]

>>> I did tests on both threaded NAPI for cpumap and my old implementation
>>> with a traffic generator and I have the following (in Kpps):
>>>
> 
> What kind of traffic is the traffic generator sending?
> 
> E.g. is this a type of traffic that gets GRO aggregated?

Yes. It's UDP, with the UDP GRO enabled on the receiver.

> 
>>>              direct Rx    direct GRO    cpumap    cpumap GRO
>>> baseline    2900         5800          2700      2700 (N/A)
>>> threaded                               2300      4000
>>> old GRO                                2300      4000
>>>
> 
> Nice results. Just to confirm, the units are in Kpps.

Yes. I.e. cpumap was giving 2.7 Mpps without GRO, then 4.0 Mpps with it.

> 
> 
>>> IOW,
>>>
>>> 1. There are no differences in perf between Lorenzo's threaded NAPI
>>>     GRO implementation and my old implementation, but Lorenzo's is also
>>>     a very nice cleanup as it switches cpumap to threaded NAPI
>>> completely
>>>     and the final diffstat even removes more lines than adds, while mine
>>>     adds a bunch of lines and refactors a couple hundred, so I'd go with
>>>     his variant.
>>>
>>> 2. After switching to NAPI, the performance without GRO decreases (2.3
>>>     Mpps vs 2.7 Mpps), but after enabling GRO the perf increases hugely
>>>     (4 Mpps vs 2.7 Mpps) even though the CPU needs to compute checksums
>>>     manually.
>>
>> One question for this: IIUC, the benefit of GRO varies with the traffic
>> mix, depending on how much the GRO logic can actually aggregate. So did
>> you test the pathological case as well (spraying packets over so many
>> flows that there is basically no aggregation taking place)? Just to make
>> sure we don't accidentally screw up performance in that case while
>> optimising for the aggregating case :)
>>
> 
> For the GRO use-case, I think a basic TCP stream throughput test (like
> netperf) should show a benefit once cpumap enable GRO, Can you confirm
> this?

Yes, TCP benefits as well.

> Or does the missing hardware RX-hash and RX-checksum cause TCP GRO not
> to fully work, yet?

GRO works well for both TCP and UDP. The main bottleneck is that GRO
calculates the checksum manually on the CPU now, since there's no
checksum status from the NIC.
Also, missing Rx hash means GRO will place packets from every flow into
the same bucket, but it's not a big deal (they get compared layer by
layer anyway).

> 
> Thanks A LOT for doing this benchmarking!

I optimized the code a bit and picked my old patches for bulk NAPI skb
cache allocation and today I got 4.7 Mpps 🎉
IOW, the result of the series (7 patches totally, but 2 are not
networking-related) is 2.7 -> 4.7 Mpps == 75%!

Daniel,

if you want, you can pick my tree[0], either full or just up to

"bpf: cpumap: switch to napi_skb_cache_get_bulk()"

(13 patches total: 6 for netdev_feature_t and 7 for the cpumap)

and test with your usecases. Would be nice to see some real world
results, not my synthetic tests :D

> --Jesper

[0]
https://github.com/alobakin/linux/compare/idpf-libie-new~52...idpf-libie-new/

Thanks,
Olek