On Mon, Feb 26 2024 at 12:54, Mathias Nyman wrote: > On 26.2.2024 11.51, Linux regression tracking (Thorsten Leemhuis) wrote: >>> I don't think reverting this series is a solution. >>> >>> This isn't really about those usb xhci patches. >>> This is about which interrupt gets assigned to which CPU. >> >> I know, but from my understanding of Linus expectations wrt to handling >> regressions it does not matter much if a bug existed earlier or >> somewhere else: what counts is the commit that exposed the problem. >> >> But I might be wrong here. Anyway, not CCing Linus for this; but I'll >> likely point him to this direction on Sunday in my next weekly report, >> unless some fix comes into sight. >> >>> Mikhail got unlucky when the network adapter interrupts on that system was >>> assigned to CPU0, clearly a more "clogged" CPU, thus causing a drop in max >>> bandwidth. >> >> But maybe others will be just as "unlucky". Or is there anything to >> believe otherwise? Maybe some aspect of the .config or local setup that >> is most likely unique to Mikhail's setup? > > I believe this is a zero-sum case. > > Others got equally lucky due to this change. > Their devices end up interrupting less clogged CPUs and see a similar > performance increase. Reverting this does not make any sense. The kernel assigns the initial interrupt affinities to the CPUs so that the number of interrupts is halfways balanced. That spreading algorithm is completely agnostic of the actual usage of the interrupts. Where e.g. the network interrupt ends up depends on the probe/enumeration order of devices. Add another PCI-E card into the machine and it will again look different. There is nothing the kernel can do about it and earlier attempts to do interrupt frequency based balancing in the kernel ended up nowhere simply because the kernel does not have enough information about the overall requirements. That's why the kernel leaves the affinity configuration for user space, e.g. irqbalanced, except for true multi-queue scenarios like NVME where the kernel binds queues and their interrupts to specific CPUs or groups of CPUs. Why ending up on CPU0 has this particular effect on Mikhails machine is unclear as we don't have any information about the overall workload, other interrupt sources on CPU0 and their frequency. That'd need to be investigated with instrumentation and might unearth some completely different underlying reason causing this behavior. So I don't think this is a regression in the true sense of regressions. It's an unfortunate coincidence and reverting the identified commits would just paper over the real problem, if there is actually one single source of trouble which causes the performance drop only on CPU0. The commits are definitely _not_ the root cause, they happen to unearth some other issue, which might be as mundane as e.g. that the NVME interrupt on CPU0 is competing with the network interrupt. So don't shoot the messenger. Thanks, tglx