Re: [PATCH] Use maximum latency when determining L1/L0s ASPM v2

Ian Kumlien <ian.kumlien@xxxxxxxxx> · Tue, 29 Sep 2020 23:13:30 +0200

On Tue, Sep 29, 2020 at 6:23 PM Alexander Duyck
<alexander.duyck@xxxxxxxxx> wrote:
>
> On Tue, Sep 29, 2020 at 5:51 AM Ian Kumlien <ian.kumlien@xxxxxxxxx> wrote:
> >
> > On Tue, Sep 29, 2020 at 1:31 AM Alexander Duyck
> > <alexander.duyck@xxxxxxxxx> wrote:
> > >
> > > On Mon, Sep 28, 2020 at 1:33 PM Ian Kumlien <ian.kumlien@xxxxxxxxx> wrote:
> > > >
> > > > On Mon, Sep 28, 2020 at 10:04 PM Ian Kumlien <ian.kumlien@xxxxxxxxx> wrote:
> > > > >
> > > > > On Mon, Sep 28, 2020 at 9:53 PM Alexander Duyck
> > > > > <alexander.duyck@xxxxxxxxx> wrote:
> > >
> > > <snip>
> > >
> > > > > > You should be able to manually disable L1 on the realtek link
> > > > > > (4:00.0<->2:04.0) instead of doing it on the upstream link on the
> > > > > > switch. That may provide a datapoint on the L1 behavior of the setup.
> > > > > > Basically if you took the realtek out of the equation in terms of the
> > > > > > L1 exit time you should see the exit time drop to no more than 33us
> > > > > > like what would be expected with just the i210.
> > > > >
> > > > > Yeah, will try it out with echo 0 >
> > > > > /sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:04.0/0000:04:00.0/link/l1_aspm
> > > > > (which is the device reported by my patch)
> > > >
> > > > So, 04:00.0 is already disabled, the existing code apparently handled
> > > > that correctly... *but*
> > > >
> > > > given the path:
> > > > 00:01.2/01:00.0/02:04.0/04:00.0 Unassigned class [ff00]: Realtek
> > > > Semiconductor Co., Ltd. Device 816e (rev 1a)
> > > >
> > > > Walking backwards:
> > > > -- 04:00.0 has l1 disabled
> > > > -- 02:04.0 doesn't have aspm?!
> > > >
> > > > lspci reports:
> > > > Capabilities: [370 v1] L1 PM Substates
> > > > L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+
> > > > L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
> > > > L1SubCtl2:
> > > > Capabilities: [400 v1] Data Link Feature <?>
> > > > Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
> > > > Capabilities: [440 v1] Lane Margining at the Receiver <?>
> > > >
> > > > However the link directory is empty.
> > > >
> > > > Anything we should know about these unknown capabilities? also aspm
> > > > L1.1 and .1.2, heh =)
> > > >
> > > > -- 01:00.0 has L1, disabling it makes the intel nic work again
> > >
> > > I recall that much. However the question is why? If there is already a
> > > 32us time to bring up the link between the NIC and the switch why
> > > would the additional 1us to also bring up the upstream port have that
> > > much of an effect? That is why I am thinking that it may be worthwhile
> > > to try to isolate things further so that only the upstream port and
> > > the NIC have L1 enabled. If we are still seeing issues in that state
> > > then I can only assume there is something off with the
> > > 00:01.2<->1:00.0 link to where it either isn't advertising the actual
> > > L1 recovery time. For example the "Width x4 (downgraded)" looks very
> > > suspicious and could be responsible for something like that if the
> > > link training is having to go through exception cases to work out the
> > > x4 link instead of a x8.
> >
> > It is a x4 link, all links that aren't "fully populated" or "fully
> > utilized" are listed as downgraded...
> >
> > So, x16 card in x8 slot or pcie 3 card in pcie 2 slot - all lists as downgraded
>
> Right, but when both sides say they are capable of x8 and are
> reporting a x4 as is the case in the 00:01.2 <-> 01:00.0 link, that
> raises some eyebrows as both sides say they are capable of x8 so it
> makes me wonder if the lanes were only run for x4 and BIOS/firmware
> wasn't configured correctly, or if only 4 of the lanes are working
> resulting in a x4 due to an electrical issue:

I think there are only 4 physical lanes and afair it has nothing to do with bios

As I stated before, it looks the same for the mellanox card.

> 00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD]
> Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
> LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <32us
> LnkSta: Speed 16GT/s (ok), Width x4 (downgraded)
>
> 01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse Switch
> LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <32us
> LnkSta: Speed 16GT/s (ok), Width x4 (downgraded)
>
> I bring it up because in the past I have seen some NICs that start out
> x4 and after a week with ASPM on and moderate activity end up dropping
> to a x1 and eventually fall off the bus due to electrical issues on
> the motherboard. I recall you mentioning that this has always
> connected at no higher than x4, but I still don't know if that is by
> design or simply because it cannot due to some other issue.

I would have to check with ASUS but I suspect that is as intended

> > > > ASPM L1 enabled:
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  5.40 MBytes  45.3 Mbits/sec    0   62.2 KBytes
> > > > [  5]   1.00-2.00   sec  4.47 MBytes  37.5 Mbits/sec    0   70.7 KBytes
> > > > [  5]   2.00-3.00   sec  4.10 MBytes  34.4 Mbits/sec    0   42.4 KBytes
> > > > [  5]   3.00-4.00   sec  4.47 MBytes  37.5 Mbits/sec    0   65.0 KBytes
> > > > [  5]   4.00-5.00   sec  4.47 MBytes  37.5 Mbits/sec    0    105 KBytes
> > > > [  5]   5.00-6.00   sec  4.47 MBytes  37.5 Mbits/sec    0   84.8 KBytes
> > > > [  5]   6.00-7.00   sec  4.47 MBytes  37.5 Mbits/sec    0   65.0 KBytes
> > > > [  5]   7.00-8.00   sec  4.10 MBytes  34.4 Mbits/sec    0   45.2 KBytes
> > > > [  5]   8.00-9.00   sec  4.47 MBytes  37.5 Mbits/sec    0   56.6 KBytes
> > > > [  5]   9.00-10.00  sec  4.47 MBytes  37.5 Mbits/sec    0   48.1 KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec  44.9 MBytes  37.7 Mbits/sec    0             sender
> > > > [  5]   0.00-10.01  sec  44.0 MBytes  36.9 Mbits/sec                  receiver
> > > >
> > > > ASPM L1 disabled:
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec   111 MBytes   935 Mbits/sec  733    761 KBytes
> > > > [  5]   1.00-2.00   sec   110 MBytes   923 Mbits/sec  733    662 KBytes
> > > > [  5]   2.00-3.00   sec   109 MBytes   912 Mbits/sec  1036   1.20 MBytes
> > > > [  5]   3.00-4.00   sec   109 MBytes   912 Mbits/sec  647    738 KBytes
> > > > [  5]   4.00-5.00   sec   110 MBytes   923 Mbits/sec  852    744 KBytes
> > > > [  5]   5.00-6.00   sec   109 MBytes   912 Mbits/sec  546    908 KBytes
> > > > [  5]   6.00-7.00   sec   109 MBytes   912 Mbits/sec  303    727 KBytes
> > > > [  5]   7.00-8.00   sec   109 MBytes   912 Mbits/sec  432    769 KBytes
> > > > [  5]   8.00-9.00   sec   110 MBytes   923 Mbits/sec  462    652 KBytes
> > > > [  5]   9.00-10.00  sec   109 MBytes   912 Mbits/sec  576    764 KBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec  1.07 GBytes   918 Mbits/sec  6320             sender
> > > > [  5]   0.00-10.01  sec  1.06 GBytes   912 Mbits/sec                  receiver
> > > >
> > > > (all measurements are over live internet - so thus variance)
> > >
> > > I forgot there were 5 total devices that were hanging off of there as
> > > well. You might try checking to see if disabling L1 on devices 5:00.0,
> > > 6:00.0 and/or 7:00.0 has any effect while leaving the L1 on 01:00.0
> > > and the NIC active. The basic idea is to go through and make certain
> > > we aren't seeing an L1 issue with one of the other downstream links on
> > > the switch.
> >
> > I did, and i saw no change, only disabling L1 on 01:00.0 gives any effect.
> > But i'd say you're right in your thinking - with L0s head-of-queue
> > stalling can happen
> > due to retry buffers and so on, was interesting to see it detailed...
>
> Okay, so the issue then is definitely the use of L1 on the 00:01.2 <->
> 01:00.0 link. The only piece we don't have the answer to is why, which
> is something we might only be able to answer if we had a PCIe
> analyzer.

Yeah... Maybe these should always have l1 disabled, i have only found
l1.1 and l1.2 errata

> > > The more I think about it the entire setup for this does seem a bit
> > > suspicious. I was looking over the lspci tree and the dump from the
> > > system. From what I can tell the upstream switch link at 01.2 <->
> > > 1:00.0 is only a Gen4 x4 link. However coming off of that is 5
> > > devices, two NICs using either Gen1 or 2 at x1, and then a USB
> > > controller and 2 SATA controller reporting Gen 4 x16. Specifically
> > > those last 3 devices have me a bit curious as they are all reporting
> > > L0s and L1 exit latencies that are the absolute minimum which has me
> > > wondering if they are even reporting actual values.
> >
> > Heh, I have been trying to google for erratas wrt to:
> > 01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse Switch
> > Upstream aka 1022:57ad
> >
> > and the cpu, to see if there is something else I could have missed,
> > but i haven't found anything relating to this yet...
>
> The thing is this could be something that there isn't an errata for.
> All it takes is a bad component somewhere and you can have one lane
> that is a bit flaky and causes the link establishment to take longer
> than it is supposed to.

> The fact that the patch resolves the issue ends up being more
> coincidental than intentional though. We should be able to have the
> NIC work with just the upstream and NIC link on the switch running
> with ASPM enabled, the fact that we can't makes me wonder about that
> upstream port link. Did you only have this one system or were there
> other similar systems you could test with?

I only have this one system...

One thing to bring up was that I did have some issues recreating the
low bandwith state when
testing the l1 settings on the other pcie ids... Ie it worked better for a while

So, It could very well be a race condition with the hardware.

> If we only have the one system it might make sense to just update the
> description for the patch and get away from focusing on this issue,
> and instead focus on the fact that the PCIe spec indicates that this
> is the way it is supposed to be calculated. If we had more of these
> systems to test with and found this is a common thing then we could
> look at adding a PCI quirk for the device to just disable ASPM
> whenever we saw it.

Yeah, agreed, I'll try the ASUS support... but I wonder if I'll get a
good answer