On Sun, Feb 26, 2023 at 6:20 AM <fk1xdcio@xxxxxxxx> wrote: > > On 2023-02-25 13:28, Chris wrote: > > I'm testing a generic 4-port PCIe x4 2.5Gbps Ethernet NIC. It uses an > > ASM1812 for the PCI packet switch to four RTL8125BG network > > controllers. > > Sorry, I forget my attachment with the PCI device information. Looks like your mail client is breaking threads. Anyway, the only thing of interest I can see in the log is that AER is reporting correctable errors on three of your four NICs: 07:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 05) Capabilities: [100 v2] Advanced Error Reporting CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- 08:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 05) Capabilities: [100 v2] Advanced Error Reporting CESta: RxErr+ BadTLP- BadDLLP+ Rollover- Timeout+ AdvNonFatalErr- 09:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 05) Capabilities: [100 v2] Advanced Error Reporting CESta: RxErr+ BadTLP- BadDLLP+ Rollover- Timeout+ AdvNonFatalErr- 0a:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 05) Capabilities: [100 v2] Advanced Error Reporting CESta: RxErr+ BadTLP- BadDLLP+ Rollover- Timeout+ AdvNonFatalErr- Bad Data Link Layer Packet errors suggest that specific card has signal integrity issues. Assuming that's true, more traffic to the NIC means more opportunities for uncorrectable errors which would explain the hard lockups. I'm not sure why setting pci=nommconf seems to fix the problem, but my guess it's just masking the issue. The AER capability is in the extended config space which requires the memory mapped config space to access so disabling that probably just stops the kernel from noticing that errors are occuring. The network stack is pretty forgiving of errors since it can just drop packets which might also explain the lower throughput too.