On Wed, Mar 08, 2023 at 02:49:42PM -0600, Bjorn Helgaas wrote: > On Sat, Feb 25, 2023 at 01:37:23PM -0500, fk1xdcio@xxxxxxxx wrote: > > I'm testing a generic 4-port PCIe x4 2.5Gbps Ethernet NIC. It uses an > > ASM1812 for the PCI packet switch to four RTL8125BG network controllers. > > > > The more load I put on the NIC the faster the system freezes. For example if > > I activate four 2.5Gbps fully saturated network connections then the system > > hard freezes almost immediately. When the system freezes it seems completely > > dead. SysRq doesn't work, serial consoles are dead, etc. so I haven't been > > able to get much debugging information. I have tested on various different > > physical systems, Xeon E5, Xeon E3, i7, and they all behave the same so it > > doesn't seem like a system hardware issue. > > > > Disabling IOMMU makes it run for a little longer before crashing. > > > > The tiny bit of error information I have been able to get under various > > conditions (eg. disabling ASPM, forcing D0, etc): > > Test #1: > > pcieport 0000:04:02.0: Unable to change power state from D3hot to D0, > > device inaccessible > > > > Test #2: > > pcieport 0000:04:02.0: can't change power state from D3cold to D0 (config > > space inaccessible) > > pcieport 0000:03:00.0: Wakeup disabled by ACPI > > pcieport 0000:04:02.0: PME# disabled > > > > Test #3: > > enp7s0: cmd = 0xff, should be 0x07 \x0a. > > enp7s0: pci link is down \x0a. > > > > At times there are several of those errors printed for the different PCI > > devices of the NIC before the system locks up. > > > > Setting "pci=nommconf" on the kernel command line is the only thing that > > seems to fix the issue but performance is degraded when using bidirectional > > transfers. 2.5Gbps TX but only 1.5Gbps RX compared to MMCONFIG enabled which > > gets full 2.5Gbps bidirectional. > > > > So it seems the MMCONFIG works sometimes but eventually something happens > > and it becomes inaccessible at which point the system freezes. Is there a > > way to keep MMCONFIG enabled for other devices but not this ASM1812 device? > > Or better, is there a way to debug and fix MMCONFIG for the device? > > Thanks for the report! > > So IIUC, "pci=nommconf" avoids the system hang completely, but network > performance is lower. Do the NIC stats show packet drops that might > explain the performance problem? > > You mentioned later that you see AER errors caused by ASPM, and they > go away if you disable power management (but the hard lockups still > happen). Is it "pcie_aspm=off" or "pcie_port_pm=off" or something > else that makes this diffference? I don't want to forget about this issue. Have you learned anything new, e.g., any answers to the questions above? I don't have any good ideas yet, but if we keep pushing on it, we might be able to figure out something. Bjorn