Re: [PATCH] Use maximum latency when determining L1/L0s ASPM v2

Alexander Duyck <alexander.duyck@xxxxxxxxx> · Tue, 29 Sep 2020 09:23:07 -0700

On Tue, Sep 29, 2020 at 5:51 AM Ian Kumlien <ian.kumlien@xxxxxxxxx> wrote:
>
> On Tue, Sep 29, 2020 at 1:31 AM Alexander Duyck
> <alexander.duyck@xxxxxxxxx> wrote:
> >
> > On Mon, Sep 28, 2020 at 1:33 PM Ian Kumlien <ian.kumlien@xxxxxxxxx> wrote:
> > >
> > > On Mon, Sep 28, 2020 at 10:04 PM Ian Kumlien <ian.kumlien@xxxxxxxxx> wrote:
> > > >
> > > > On Mon, Sep 28, 2020 at 9:53 PM Alexander Duyck
> > > > <alexander.duyck@xxxxxxxxx> wrote:
> >
> > <snip>
> >
> > > > > You should be able to manually disable L1 on the realtek link
> > > > > (4:00.0<->2:04.0) instead of doing it on the upstream link on the
> > > > > switch. That may provide a datapoint on the L1 behavior of the setup.
> > > > > Basically if you took the realtek out of the equation in terms of the
> > > > > L1 exit time you should see the exit time drop to no more than 33us
> > > > > like what would be expected with just the i210.
> > > >
> > > > Yeah, will try it out with echo 0 >
> > > > /sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:04.0/0000:04:00.0/link/l1_aspm
> > > > (which is the device reported by my patch)
> > >
> > > So, 04:00.0 is already disabled, the existing code apparently handled
> > > that correctly... *but*
> > >
> > > given the path:
> > > 00:01.2/01:00.0/02:04.0/04:00.0 Unassigned class [ff00]: Realtek
> > > Semiconductor Co., Ltd. Device 816e (rev 1a)
> > >
> > > Walking backwards:
> > > -- 04:00.0 has l1 disabled
> > > -- 02:04.0 doesn't have aspm?!
> > >
> > > lspci reports:
> > > Capabilities: [370 v1] L1 PM Substates
> > > L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+
> > > L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
> > > L1SubCtl2:
> > > Capabilities: [400 v1] Data Link Feature <?>
> > > Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
> > > Capabilities: [440 v1] Lane Margining at the Receiver <?>
> > >
> > > However the link directory is empty.
> > >
> > > Anything we should know about these unknown capabilities? also aspm
> > > L1.1 and .1.2, heh =)
> > >
> > > -- 01:00.0 has L1, disabling it makes the intel nic work again
> >
> > I recall that much. However the question is why? If there is already a
> > 32us time to bring up the link between the NIC and the switch why
> > would the additional 1us to also bring up the upstream port have that
> > much of an effect? That is why I am thinking that it may be worthwhile
> > to try to isolate things further so that only the upstream port and
> > the NIC have L1 enabled. If we are still seeing issues in that state
> > then I can only assume there is something off with the
> > 00:01.2<->1:00.0 link to where it either isn't advertising the actual
> > L1 recovery time. For example the "Width x4 (downgraded)" looks very
> > suspicious and could be responsible for something like that if the
> > link training is having to go through exception cases to work out the
> > x4 link instead of a x8.
>
> It is a x4 link, all links that aren't "fully populated" or "fully
> utilized" are listed as downgraded...
>
> So, x16 card in x8 slot or pcie 3 card in pcie 2 slot - all lists as downgraded

Right, but when both sides say they are capable of x8 and are
reporting a x4 as is the case in the 00:01.2 <-> 01:00.0 link, that
raises some eyebrows as both sides say they are capable of x8 so it
makes me wonder if the lanes were only run for x4 and BIOS/firmware
wasn't configured correctly, or if only 4 of the lanes are working
resulting in a x4 due to an electrical issue:

00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD]
Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <32us
LnkSta: Speed 16GT/s (ok), Width x4 (downgraded)

01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse Switch
LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <32us
LnkSta: Speed 16GT/s (ok), Width x4 (downgraded)

I bring it up because in the past I have seen some NICs that start out
x4 and after a week with ASPM on and moderate activity end up dropping
to a x1 and eventually fall off the bus due to electrical issues on
the motherboard. I recall you mentioning that this has always
connected at no higher than x4, but I still don't know if that is by
design or simply because it cannot due to some other issue.

> > > ASPM L1 enabled:
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec  5.40 MBytes  45.3 Mbits/sec    0   62.2 KBytes
> > > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   70.7 KBytes
> > > [  5]   2.00-3.00   sec  4.10 MBytes  34.4 Mbits/sec    0   42.4 KBytes
> > > [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0   65.0 KBytes
> > > [  5]   4.00-5.00   sec  4.47 MBytes  37.5 Mbits/sec    0    105 KBytes
> > > [  5]   5.00-6.00   sec  4.47 MBytes  37.5 Mbits/sec    0   84.8 KBytes
> > > [  5]   6.00-7.00   sec  4.47 MBytes  37.5 Mbits/sec    0   65.0 KBytes
> > > [  5]   7.00-8.00   sec  4.10 MBytes  34.4 Mbits/sec    0   45.2 KBytes
> > > [  5]   8.00-9.00   sec  4.47 MBytes  37.5 Mbits/sec    0   56.6 KBytes
> > > [  5]   9.00-10.00  sec  4.47 MBytes  37.5 Mbits/sec    0   48.1 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  44.9 MBytes  37.7 Mbits/sec    0             sender
> > > [  5]   0.00-10.01  sec  44.0 MBytes  36.9 Mbits/sec                  receiver
> > >
> > > ASPM L1 disabled:
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec   111 MBytes   935 Mbits/sec  733    761 KBytes
> > > [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec  733    662 KBytes
> > > [  5]   2.00-3.00   sec   109 MBytes   912 Mbits/sec  1036   1.20 MBytes
> > > [  5]   3.00-4.00   sec   109 MBytes   912 Mbits/sec  647    738 KBytes
> > > [  5]   4.00-5.00   sec   110 MBytes   923 Mbits/sec  852    744 KBytes
> > > [  5]   5.00-6.00   sec   109 MBytes   912 Mbits/sec  546    908 KBytes
> > > [  5]   6.00-7.00   sec   109 MBytes   912 Mbits/sec  303    727 KBytes
> > > [  5]   7.00-8.00   sec   109 MBytes   912 Mbits/sec  432    769 KBytes
> > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec  462    652 KBytes
> > > [  5]   9.00-10.00  sec   109 MBytes   912 Mbits/sec  576    764 KBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  1.07 GBytes   918 Mbits/sec  6320             sender
> > > [  5]   0.00-10.01  sec  1.06 GBytes   912 Mbits/sec                  receiver
> > >
> > > (all measurements are over live internet - so thus variance)
> >
> > I forgot there were 5 total devices that were hanging off of there as
> > well. You might try checking to see if disabling L1 on devices 5:00.0,
> > 6:00.0 and/or 7:00.0 has any effect while leaving the L1 on 01:00.0
> > and the NIC active. The basic idea is to go through and make certain
> > we aren't seeing an L1 issue with one of the other downstream links on
> > the switch.
>
> I did, and i saw no change, only disabling L1 on 01:00.0 gives any effect.
> But i'd say you're right in your thinking - with L0s head-of-queue
> stalling can happen
> due to retry buffers and so on, was interesting to see it detailed...

Okay, so the issue then is definitely the use of L1 on the 00:01.2 <->
01:00.0 link. The only piece we don't have the answer to is why, which
is something we might only be able to answer if we had a PCIe
analyzer.

> > The more I think about it the entire setup for this does seem a bit
> > suspicious. I was looking over the lspci tree and the dump from the
> > system. From what I can tell the upstream switch link at 01.2 <->
> > 1:00.0 is only a Gen4 x4 link. However coming off of that is 5
> > devices, two NICs using either Gen1 or 2 at x1, and then a USB
> > controller and 2 SATA controller reporting Gen 4 x16. Specifically
> > those last 3 devices have me a bit curious as they are all reporting
> > L0s and L1 exit latencies that are the absolute minimum which has me
> > wondering if they are even reporting actual values.
>
> Heh, I have been trying to google for erratas wrt to:
> 01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse Switch
> Upstream aka 1022:57ad
>
> and the cpu, to see if there is something else I could have missed,
> but i haven't found anything relating to this yet...

The thing is this could be something that there isn't an errata for.
All it takes is a bad component somewhere and you can have one lane
that is a bit flaky and causes the link establishment to take longer
than it is supposed to.

The fact that the patch resolves the issue ends up being more
coincidental than intentional though. We should be able to have the
NIC work with just the upstream and NIC link on the switch running
with ASPM enabled, the fact that we can't makes me wonder about that
upstream port link. Did you only have this one system or were there
other similar systems you could test with?

If we only have the one system it might make sense to just update the
description for the patch and get away from focusing on this issue,
and instead focus on the fact that the PCIe spec indicates that this
is the way it is supposed to be calculated. If we had more of these
systems to test with and found this is a common thing then we could
look at adding a PCI quirk for the device to just disable ASPM
whenever we saw it.