Re: ASMedia ASM1812 PCIe switch causes system to freeze hard

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Mar 08, 2023 at 02:49:42PM -0600, Bjorn Helgaas wrote:
> On Sat, Feb 25, 2023 at 01:37:23PM -0500, fk1xdcio@xxxxxxxx wrote:
> > I'm testing a generic 4-port PCIe x4 2.5Gbps Ethernet NIC. It uses an
> > ASM1812 for the PCI packet switch to four RTL8125BG network controllers.
> > 
> > The more load I put on the NIC the faster the system freezes. For example if
> > I activate four 2.5Gbps fully saturated network connections then the system
> > hard freezes almost immediately. When the system freezes it seems completely
> > dead. SysRq doesn't work, serial consoles are dead, etc. so I haven't been
> > able to get much debugging information. I have tested on various different
> > physical systems, Xeon E5, Xeon E3, i7, and they all behave the same so it
> > doesn't seem like a system hardware issue.
> > 
> > Disabling IOMMU makes it run for a little longer before crashing.
> > 
> > The tiny bit of error information I have been able to get under various
> > conditions (eg. disabling ASPM, forcing D0, etc):
> >   Test #1:
> >   pcieport 0000:04:02.0: Unable to change power state from D3hot to D0,
> > device inaccessible
> > 
> >   Test #2:
> >   pcieport 0000:04:02.0: can't change power state from D3cold to D0 (config
> > space inaccessible)
> >   pcieport 0000:03:00.0: Wakeup disabled by ACPI
> >   pcieport 0000:04:02.0: PME# disabled
> > 
> >   Test #3:
> >   enp7s0: cmd = 0xff, should be 0x07 \x0a.
> >   enp7s0: pci link is down \x0a.
> > 
> > At times there are several of those errors printed for the different PCI
> > devices of the NIC before the system locks up.
> > 
> > Setting "pci=nommconf" on the kernel command line is the only thing that
> > seems to fix the issue but performance is degraded when using bidirectional
> > transfers. 2.5Gbps TX but only 1.5Gbps RX compared to MMCONFIG enabled which
> > gets full 2.5Gbps bidirectional.
> > 
> > So it seems the MMCONFIG works sometimes but eventually something happens
> > and it becomes inaccessible at which point the system freezes. Is there a
> > way to keep MMCONFIG enabled for other devices but not this ASM1812 device?
> > Or better, is there a way to debug and fix MMCONFIG for the device?
> 
> Thanks for the report!
> 
> So IIUC, "pci=nommconf" avoids the system hang completely, but network
> performance is lower.  Do the NIC stats show packet drops that might
> explain the performance problem?
> 
> You mentioned later that you see AER errors caused by ASPM, and they
> go away if you disable power management (but the hard lockups still
> happen).  Is it "pcie_aspm=off" or "pcie_port_pm=off" or something
> else that makes this diffference?

I don't want to forget about this issue.  Have you learned anything
new, e.g., any answers to the questions above?  I don't have any good
ideas yet, but if we keep pushing on it, we might be able to figure
out something.

Bjorn



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux