I'm testing a generic 4-port PCIe x4 2.5Gbps Ethernet NIC. It uses an
ASM1812 for the PCI packet switch to four RTL8125BG network controllers.
The more load I put on the NIC the faster the system freezes. For
example if I activate four 2.5Gbps fully saturated network connections
then the system hard freezes almost immediately. When the system freezes
it seems completely dead. SysRq doesn't work, serial consoles are dead,
etc. so I haven't been able to get much debugging information. I have
tested on various different physical systems, Xeon E5, Xeon E3, i7, and
they all behave the same so it doesn't seem like a system hardware
issue.
Disabling IOMMU makes it run for a little longer before crashing.
The tiny bit of error information I have been able to get under various
conditions (eg. disabling ASPM, forcing D0, etc):
Test #1:
pcieport 0000:04:02.0: Unable to change power state from D3hot to D0,
device inaccessible
Test #2:
pcieport 0000:04:02.0: can't change power state from D3cold to D0
(config space inaccessible)
pcieport 0000:03:00.0: Wakeup disabled by ACPI
pcieport 0000:04:02.0: PME# disabled
Test #3:
enp7s0: cmd = 0xff, should be 0x07 \x0a.
enp7s0: pci link is down \x0a.
At times there are several of those errors printed for the different PCI
devices of the NIC before the system locks up.
Setting "pci=nommconf" on the kernel command line is the only thing that
seems to fix the issue but performance is degraded when using
bidirectional transfers. 2.5Gbps TX but only 1.5Gbps RX compared to
MMCONFIG enabled which gets full 2.5Gbps bidirectional.
So it seems the MMCONFIG works sometimes but eventually something
happens and it becomes inaccessible at which point the system freezes.
Is there a way to keep MMCONFIG enabled for other devices but not this
ASM1812 device? Or better, is there a way to debug and fix MMCONFIG for
the device?