Re: CONFIG_PCIEASPM breaks PCIe on Marvell Armada 385 machine

Bjorn Helgaas <helgaas@xxxxxxxxxx> · Wed, 18 Jan 2017 08:22:50 -0600

On Tue, Jan 17, 2017 at 03:37:10PM -0800, David Daney wrote:
> On 01/17/2017 02:22 PM, Bjorn Helgaas wrote:
> >[+cc David]
> >
> >On Tue, Jan 17, 2017 at 09:02:58PM +0000, Russell King - ARM Linux wrote:
> >>On Tue, Jan 17, 2017 at 07:34:14PM +0000, Russell King - ARM Linux wrote:
> >>>Uwe, can you try:
> >>>
> >>>setpci -s <whatever-the-id-of-the-root-is-it's-blanked-out-in-the-above> \
> >>>	0x50.w=0x60
> >>>
> >>>and see whether it remains alive (you can check by reading the root
> >>>register 0x52.w - bit 12 should be set once bit 11 clears again.
> >>
> >>For reference, this I got wrong...
> >>
> >>0xf1041a04 bit 0 indicates link status (0 = link up, 1 = link down).
> >>
> >>>If that's successful, maybe setting the common clock bit on the PCIe
> >>>device is what's causing the problem, in which case:
> >>>
> >>>setpci -s 02:00.0 0x80.w=0x40
> >>>setpci -s <whatever-the-id-of-the-root-is-it's-blanked-out-in-the-above> \
> >>>	0x50.w=0x60
> >>
> >>Having worked with Uwe over IRC, it seems that any request to retrain
> >>causes the link to go down, either with or without the common clock bit
> >>set:
> >>
> >># setpci -s 2.0 0x50.w=0x60
> >># setpci -s 2.0 0x52.w
> >>0011
> >># memtool md 0xf1041a04+4
> >>f1041a04: 00010201
> >>... reboot ...
> >># setpci -s 2.0 0x50.w=0x20
> >># memtool md 0xf1041a04+4
> >>f1041a04: 00010201
> >>
> >>which doesn't point towards ASPM itself, but the problem is caused by
> >>a side effect of ASPM's setup code which always triggers a retrain.
> >>
> >>Bit 5 in that register is documented (at least in the Armada 370 docs
> >>and Armada XP docs I have) as:
> >>
> >>5  RetrnLnk  RW    Retrain Link
> >>             0x0   This bit forces the device to initiate link retraining.
> >>                   Always returns 0 when read.
> >>                   NOTE: If configured as an Endpoint, this field is
> >>                   reserved and has no effect.
> >>
> >>Bjorn, are you aware of similar situations where a request for the PCIe
> >>link to be retrained causes it to fail?
> 
> 
> Link (re)training can fail for several reasons including, but not
> limited to:
> 
> - Poor signal propagation through the
> chips/packages/boards/connectors, also known as Signal Integrity
> (SI) problmes.
> 
> - Incorrect implementation, in hardware, of link training protocols
> at either end of the link
> 
> Usually, system and PCIe device vendors do a lot of testing and
> signal analysis across a variety of configurations with the end goal
> being that PCIe looks like a bullet-proof interconnect to the end
> consumer.
> 
> Unfortunatly, sometimes it doesn't work.  In these cases, the
> vendors of the devices on each end of the link tend to point fingers
> at the link partner for being detective in some way.
> 
> This patch:
> 
> >
> >The only one that comes to mind is this patch from David (CC'd) that
> >avoids ASPM-related retrains when we know the link doesn't support ASPM:
> >http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e53f9a28bee3
> >
> 
> Is an attempt to work around the problem from the system (host) end.
> If the system vendor knows a priori that a defective PCIe device is
> present in the system, the PCIe root port can be configured to
> indicate no ASPM is supported, resulting (with the patch) in no link
> retraining being attempted.
> 
> To me it feels that we need a black list of devices that fail at a
> high rate in the link retraining, that when encountered would
> disable ASPM on the link where they reside.

I should have asked you for details about the defective devices
related to e53f9a28bee3 :)  If we had included that in the changelog,
we would have something to seed a blacklist with.

There are several situations other than ASPM where link retraining is
required per spec (rate change, error handling, etc), and I guess we'd
have to avoid all of them.   So I suppose e53f9a28bee3 avoids the most
obvious failures, but maybe we could still see issues in those other
cases.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html