On 01/17/2017 02:22 PM, Bjorn Helgaas wrote:
[+cc David]
On Tue, Jan 17, 2017 at 09:02:58PM +0000, Russell King - ARM Linux wrote:
On Tue, Jan 17, 2017 at 07:34:14PM +0000, Russell King - ARM Linux wrote:
Uwe, can you try:
setpci -s <whatever-the-id-of-the-root-is-it's-blanked-out-in-the-above> \
0x50.w=0x60
and see whether it remains alive (you can check by reading the root
register 0x52.w - bit 12 should be set once bit 11 clears again.
For reference, this I got wrong...
0xf1041a04 bit 0 indicates link status (0 = link up, 1 = link down).
If that's successful, maybe setting the common clock bit on the PCIe
device is what's causing the problem, in which case:
setpci -s 02:00.0 0x80.w=0x40
setpci -s <whatever-the-id-of-the-root-is-it's-blanked-out-in-the-above> \
0x50.w=0x60
Having worked with Uwe over IRC, it seems that any request to retrain
causes the link to go down, either with or without the common clock bit
set:
# setpci -s 2.0 0x50.w=0x60
# setpci -s 2.0 0x52.w
0011
# memtool md 0xf1041a04+4
f1041a04: 00010201
... reboot ...
# setpci -s 2.0 0x50.w=0x20
# memtool md 0xf1041a04+4
f1041a04: 00010201
which doesn't point towards ASPM itself, but the problem is caused by
a side effect of ASPM's setup code which always triggers a retrain.
Bit 5 in that register is documented (at least in the Armada 370 docs
and Armada XP docs I have) as:
5 RetrnLnk RW Retrain Link
0x0 This bit forces the device to initiate link retraining.
Always returns 0 when read.
NOTE: If configured as an Endpoint, this field is
reserved and has no effect.
Bjorn, are you aware of similar situations where a request for the PCIe
link to be retrained causes it to fail?
Link (re)training can fail for several reasons including, but not
limited to:
- Poor signal propagation through the chips/packages/boards/connectors,
also known as Signal Integrity (SI) problmes.
- Incorrect implementation, in hardware, of link training protocols at
either end of the link
Usually, system and PCIe device vendors do a lot of testing and signal
analysis across a variety of configurations with the end goal being that
PCIe looks like a bullet-proof interconnect to the end consumer.
Unfortunatly, sometimes it doesn't work. In these cases, the vendors of
the devices on each end of the link tend to point fingers at the link
partner for being detective in some way.
This patch:
The only one that comes to mind is this patch from David (CC'd) that
avoids ASPM-related retrains when we know the link doesn't support ASPM:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e53f9a28bee3
Is an attempt to work around the problem from the system (host) end. If
the system vendor knows a priori that a defective PCIe device is present
in the system, the PCIe root port can be configured to indicate no ASPM
is supported, resulting (with the patch) in no link retraining being
attempted.
To me it feels that we need a black list of devices that fail at a high
rate in the link retraining, that when encountered would disable ASPM on
the link where they reside.
Just my $0.02
David Daney
Side note: it looks like we don't use the recommended retrain
algorithm in the implementation note about avoiding race conditions in
PCIe r3.0, sec 7.8.7.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html