Re: CONFIG_PCIEASPM breaks PCIe on Marvell Armada 385 machine

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 01/18/2017 06:22 AM, Bjorn Helgaas wrote:
On Tue, Jan 17, 2017 at 03:37:10PM -0800, David Daney wrote:
[...]


Link (re)training can fail for several reasons including, but not
limited to:

- Poor signal propagation through the
chips/packages/boards/connectors, also known as Signal Integrity
(SI) problmes.

- Incorrect implementation, in hardware, of link training protocols
at either end of the link

Usually, system and PCIe device vendors do a lot of testing and
signal analysis across a variety of configurations with the end goal
being that PCIe looks like a bullet-proof interconnect to the end
consumer.

Unfortunatly, sometimes it doesn't work.  In these cases, the
vendors of the devices on each end of the link tend to point fingers
at the link partner for being detective in some way.

This patch:


The only one that comes to mind is this patch from David (CC'd) that
avoids ASPM-related retrains when we know the link doesn't support ASPM:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e53f9a28bee3


Is an attempt to work around the problem from the system (host) end.
If the system vendor knows a priori that a defective PCIe device is
present in the system, the PCIe root port can be configured to
indicate no ASPM is supported, resulting (with the patch) in no link
retraining being attempted.

To me it feels that we need a black list of devices that fail at a
high rate in the link retraining, that when encountered would
disable ASPM on the link where they reside.

I should have asked you for details about the defective devices
related to e53f9a28bee3 :)  If we had included that in the changelog,
we would have something to seed a blacklist with.

The device I saw failing I don't have access to any more, so I don't know the PCI IDs. It was a solid-state storage device with a Xilinx FPGA acting as the PCIe endpoint. In any event, it would only fail in about 0.5% of system boots, it wasn't the case that it could be made to reliably fail.

The tricky thing here is assigning the blame for failure in link training. In the case in question we spent many months analysing the analog properties of the bus and examining/decoding analog scope captures of the failures before credibly assigning blame to the other guy. Usually what happens is the device vendor accurately claims that their device works flawlessly in conjunction with certain Intel root ports, so the problem must be fixed in the root port of the failing system. If you have a black list, you may be disabling ASPM in systems where it can work without failures.




There are several situations other than ASPM where link retraining is
required per spec (rate change, error handling, etc), and I guess we'd
have to avoid all of them.   So I suppose e53f9a28bee3 avoids the most
obvious failures, but maybe we could still see issues in those other
cases.

Bjorn

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux