From: Lorenzo Bianconi <lorenzo.bianconi@xxxxxxxxxx> Date: Tue, 13 Aug 2024 18:27:41 +0200 > On Aug 13, Alexander Lobakin wrote: >> From: Alexander Lobakin <aleksander.lobakin@xxxxxxxxx> >> Date: Thu, 8 Aug 2024 13:57:00 +0200 >> >>> From: Lorenzo Bianconi <lorenzo.bianconi@xxxxxxxxxx> >>> Date: Thu, 8 Aug 2024 06:54:06 +0200 >>> >>>>> Hi Alexander, >>>>> >>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote: >>>>>> cpumap has its own BH context based on kthread. It has a sane batch >>>>>> size of 8 frames per one cycle. >>>>>> GRO can be used on its own, adjust cpumap calls to the >>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which >>>>>> processes skbs by batches, but doesn't involve GRO layer at all. >>>>>> It is most beneficial when a NIC which frame come from is XDP >>>>>> generic metadata-enabled, but in plenty of tests GRO performs better >>>>>> than listed receiving even given that it has to calculate full frame >>>>>> checksums on CPU. >>>>>> As GRO passes the skbs to the upper stack in the batches of >>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the >>>>>> device where the frame comes from, it is enough to disable GRO >>>>>> netdev feature on it to completely restore the original behaviour: >>>>>> untouched frames will be being bulked and passed to the upper stack >>>>>> by 8, as it was with netif_receive_skb_list(). >>>>>> >>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@xxxxxxxxx> >>>>>> --- >>>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++----- >>>>>> 1 file changed, 38 insertions(+), 5 deletions(-) >>>>>> >>>>> >>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think >>>>> cpumap is still missing this. >>> >>> The only concern for having GRO in cpumap without metadata from the NIC >>> descriptor was that when the checksum status is missing, GRO calculates >>> the checksum on CPU, which is not really fast. >>> But I remember sometimes GRO was faster despite that. >>> >>>>> >>>>> I have a production use case for this now. We want to do some intelligent >>>>> RX steering and I think GRO would help over list-ified receive in some cases. >>>>> We would prefer steer in HW (and thus get existing GRO support) but not all >>>>> our NICs support it. So we need a software fallback. >>>>> >>>>> Are you still interested in merging the cpumap + GRO patches? >>> >>> For sure I can revive this part. I was planning to get back to this >>> branch and pick patches which were not related to XDP hints and send >>> them separately. >>> >>>> >>>> Hi Daniel and Alex, >>>> >>>> Recently I worked on a PoC to add GRO support to cpumap codebase: >>>> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e >>>> Here I added GRO support to cpumap through gro-cells. >>>> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55 >>>> Here I added GRO support to cpumap trough napi-threaded APIs (with a some >>>> changes to them). >>> >>> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta >>> overkill, that's why I separated GRO structure from &napi_struct. >>> >>> Let me maybe find some free time, I would then test all 3 solutions >>> (mine, gro_cells, threaded NAPI) and pick/send the best? >>> >>>> >>>> Please note I have not run any performance tests so far, just verified it does >>>> not crash (I was planning to resume this work soon). Please let me know if it >>>> works for you. >> >> I did tests on both threaded NAPI for cpumap and my old implementation >> with a traffic generator and I have the following (in Kpps): >> >> direct Rx direct GRO cpumap cpumap GRO >> baseline 2900 5800 2700 2700 (N/A) >> threaded 2300 4000 >> old GRO 2300 4000 > > out of my curiority, have you tested even the gro_cells one? I haven't. I mean I could, but I don't feel like cpumap's kthread + separate NAPI then could give better results than merged NAPI + kthread. > > Lorenzo Thanks, Olek