Re: XDP Performance Regression in recent kernel versions

Samuel Dobron <sdobron@xxxxxxxxxx> · Mon, 29 Jul 2024 20:00:42 +0200

Ah, sorry.
Yes, I was talking about 6.4 regression.

I double-checked that v5.15 regression and I don't see anything
that significant as Sebastiano. I ran a couple of tests for:
* kernel-5.10.0-0.rc6.90.eln105
* kernel-5.14.0-60.eln112
* kernel-5.15.0-0.rc7.53.eln113
* kernel-5.16.0-60.eln114
* kernel-6.11.0-0.rc0.20240724git786c8248dbd3.12.eln141

The results of XDP_DROP on receiving side (the one, that is dropping
packets) are more or less the same ~20.5Mpps (17.5Mpps on 6.11, but
that's due to 6.4 regression). CPU is bottleneck, so 100% cpu utilization
for all the kernels on both ends - generator and receiver. We use pktgen
as a generator, both generator and receiver machines use mlx5 NIC.

However, I noticed that between 5.10 and 5.14 there is 30Mpps->22Mpps
regression BUT at the GENERATOR side, CPU util remains the same
on both ends and amount of dropped packets on receiver side is
the same as well (since it's CPU bottlenecked). Other drivers seems
to be unaffected.

That's probably something unrelated to Sebastiano's regression,
but I believe it's worth to mention.

And so, no idea where Sebastiano's regression comes from. I can see,
he uses ConnectX-6, we don't have those, only ConnectX-5, cloud that
be the problem?

Thanks,
Sam.

On Fri, Jul 26, 2024 at 10:09 AM Dragos Tatulea <dtatulea@xxxxxxxxxx> wrote:
>
> Hi,
>
> On Wed, 2024-07-24 at 17:36 +0200, Toke Høiland-Jørgensen wrote:
> > Carolina Jubran <cjubran@xxxxxxxxxx> writes:
> >
> > > On 22/07/2024 12:26, Dragos Tatulea wrote:
> > > > On Sun, 2024-06-30 at 14:43 +0300, Tariq Toukan wrote:
> > > > >
> > > > > On 21/06/2024 15:35, Samuel Dobron wrote:
> > > > > > Hey all,
> > > > > >
> > > > > > Yeah, we do tests for ELN kernels [1] on a regular basis. Since
> > > > > > ~January of this year.
> > > > > >
> > > > > > As already mentioned, mlx5 is the only driver affected by this regression.
> > > > > > Unfortunately, I think Jesper is actually hitting 2 regressions we noticed,
> > > > > > the one already mentioned by Toke, another one [0] has been reported
> > > > > > in early February.
> > > > > > Btw. issue mentioned by Toke has been moved to Jira, see [5].
> > > > > >
> > > > > > Not sure all of you are able to see the content of [0], Jira says it's
> > > > > > RH-confidental.
> > > > > > So, I am not sure how much I can share without being fired :D. Anyway,
> > > > > > affected kernels have been released a while ago, so anyone can find it
> > > > > > on its own.
> > > > > > Basically, we detected 5% regression on XDP_DROP+mlx5 (currently, we
> > > > > > don't have data for any other XDP mode) in kernel-5.14 compared to
> > > > > > previous builds.
> > > > > >
> > > > > >   From tests history, I can see (most likely) the same improvement
> > > > > > on 6.10rc2 (from 15Mpps to 17-18Mpps), so I'd say 20% drop has been
> > > > > > (partially) fixed?
> > > > > >
> > > > > > For earlier 6.10. kernels we don't have data due to [3] (there is regression on
> > > > > > XDP_DROP as well, but I believe it's turbo-boost issue, as I mentioned
> > > > > > in issue).
> > > > > > So if you want to run tests on 6.10. please see [3].
> > > > > >
> > > > > > Summary XDP_DROP+mlx5@25G:
> > > > > > kernel       pps
> > > > > > <5.14        20.5M        baseline
> > > > > > > =5.14      19M           [0]
> > > > > > <6.4          19-20M      baseline for ELN kernels
> > > > > > > =6.4        15M           [4 and 5] (mentioned by Toke)
> > > > >
> > > > > + @Dragos
> > > > >
> > > > > That's about when we added several changes to the RX datapath.
> > > > > Most relevant are:
> > > > > - Fully removing the in-driver RX page-cache.
> > > > > - Refactoring to support XDP multi-buffer.
> > > > >
> > > > > We tested XDP performance before submission, I don't recall we noticed
> > > > > such a degradation.
> > > >
> > > > Adding Carolina to post her analysis on this.
> > >
> > > Hey everyone,
> > >
> > > After investigating the issue, it seems the performance degradation is
> > > linked to the commit "x86/bugs: Report Intel retbleed vulnerability"
> > > (6ad0ad2bf8a67).
> >
> > Hmm, that commit is from June 2022, [...]
> >
> The results from the very first mail in this thread from Sebastiano were
> showing a 30Mpps -> 21.3Mpps XDP_DROP regression between 5.15 and 6.2. This
> is what Carolina was focused on. Furthermore, the results from Samuel don't show
> this regression. Seems like the discussion is now focused on the 6.4 regression?
>
> > [...] and according to Samuel's tests,
> > this issue was introduced sometime between commits b6dad5178cea and
> > 40f71e7cd3c6 (both of which are dated in June 2023).
> >
> Thanks for the commit range (now I know how to decode ELN kernel versions :)).
> Strangely this range doesn't have anything suspicious. I would have expected to
> see the page_pool or the XDP multibuf changes would have shown up in this range.
> But they are already present in the working version... Anyway, we'll keep on
> looking.
>
> >  Besides, if it was
> > a retbleed mitigation issue, that would affect other drivers as well,
> > no? Our testing only shows this regression on mlx5, not on the intel
> > drivers.
> >
> >
> > > > > I'll check with Dragos as he probably has these reports.
> > > > >
> > > > We only noticed a 6% degradation for XDP_XDROP.
> > > >
> > > > https://lore.kernel.org/netdev/b6fcfa8b-c2b3-8a92-fb6e-0760d5f6f5ff@xxxxxxxxxx/T/
> >
> > That message mentions that "This will be handled in a different patch
> > series by adding support for multi-packet per page." - did that ever go
> > in?
> >
> Nope, no XDP multi-packet per page yet.
>
> Thanks,
> Dragos