Re: This is the fourth time I've tried to find what led to the regression of outgoing network speed and each time I find the merge commit 8c94ccc7cd691472461448f98e2372c75849406c

Mathias Nyman <mathias.nyman@xxxxxxxxxxxxxxx> · Mon, 26 Feb 2024 12:54:03 +0200

On 26.2.2024 11.51, Linux regression tracking (Thorsten Leemhuis) wrote:
On 26.02.24 10:24, Mathias Nyman wrote:
On 26.2.2024 7.45, Linux regression tracking (Thorsten Leemhuis) wrote:
On 21.02.24 14:44, Mathias Nyman wrote:
On 21.2.2024 1.43, Randy Dunlap wrote:
On 2/20/24 15:41, Randy Dunlap wrote:
{+ tglx]
On 2/20/24 15:19, Mikhail Gavrilov wrote:
On Mon, Feb 19, 2024 at 2:41 PM Mikhail Gavrilov
<mikhail.v.gavrilov@xxxxxxxxx> wrote:
I spotted network performance regression and it turned out, this was
due to the network card getting other interrupt. It is a side effect
of commit 57e153dfd0e7a080373fe5853c5609443d97fa5a.
That's a merge commit (AFAIK, maybe not so much). The commit in
mainline is:

commit f977f4c9301c
Author: Niklas Neronin <niklas.neronin@xxxxxxxxxxxxxxx>
Date:   Fri Dec 1 17:06:40 2023 +0200

       xhci: add handler for only one interrupt line

Installing irqbalance daemon did not help. Maybe someone experienced
such a problem?

Thomas, would you look at this, please?

A network device and xhci (USB) driver are now sharing interrupts.
This causes a large performance decrease for the networking device.

Short recap:

Thx for that. As the 6.8 release is merely two or three weeks away while
a fix is nowhere near in sight yet (afaics!) I start to wonder if we
should consider a revert here and try reapplying the culprit in a later
cycle when this problem is fixed.

Thx for the reply.

I don't think reverting this series is a solution.

This isn't really about those usb xhci patches.
This is about which interrupt gets assigned to which CPU.

I know, but from my understanding of Linus expectations wrt to handling
regressions it does not matter much if a bug existed earlier or
somewhere else: what counts is the commit that exposed the problem.

But I might be wrong here. Anyway, not CCing Linus for this; but I'll
likely point him to this direction on Sunday in my next weekly report,
unless some fix comes into sight.

Mikhail got unlucky when the network adapter interrupts on that system was
assigned to CPU0, clearly a more "clogged" CPU, thus causing a drop in max
bandwidth.

But maybe others will be just as "unlucky". Or is there anything to
believe otherwise? Maybe some aspect of the .config or local setup that
is most likely unique to Mikhail's setup?

I believe this is a zero-sum case.

Others got equally lucky due to this change.
Their devices end up interrupting less clogged CPUs and see a similar
performance increase.

Thanks
Mathias