Re: [PATCH] Use maximum latency when determining L1/L0s ASPM v2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Sep 23, 2020 at 11:23 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> On Wed, Sep 23, 2020 at 01:29:18AM +0200, Ian Kumlien wrote:
> > On Wed, Sep 23, 2020 at 1:00 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > > On Tue, Sep 22, 2020 at 11:02:35PM +0200, Ian Kumlien wrote:
>
> > > > commit db3d9c4baf4ab177d87b5cd41f624f5901e7390f
> > > > Author: Ian Kumlien <ian.kumlien@xxxxxxxxx>
> > > > Date:   Sun Jul 26 16:01:15 2020 +0200
> > > >
> > > >     Use maximum latency when determining L1 ASPM
> > > >
> > > >     If it's not, we clear the link for the path that had too large latency.
> > > >
> > > >     Currently we check the maximum latency of upstream and downstream
> > > >     per link, not the maximum for the path
> > > >
> > > >     This would work if all links have the same latency, but:
> > > >     endpoint -> c -> b -> a -> root  (in the order we walk the path)
> > > >
> > > >     If c or b has the higest latency, it will not register
> > > >
> > > >     Fix this by maintaining the maximum latency value for the path
> > > >
> > > >     See this bugzilla for more information:
> > > >     https://bugzilla.kernel.org/show_bug.cgi?id=208741
> > > >
> > > >     This fixes an issue for me where my desktops machines maximum bandwidth
> > > >     for remote connections dropped from 933 MBit to ~40 MBit.
> > > >
> > > >     The bug became obvious once we enabled ASPM on all links:
> > > >     66ff14e59e8a (PCI/ASPM: Allow ASPM on links to PCIe-to-PCI/PCI-X Bridges)
> > >
> > > I can't connect the dots here yet.  I don't see a PCIe-to-PCI/PCI-X
> > > bridge in your lspci, so I can't figure out why this commit would make
> > > a difference for you.
> > >
> > > IIUC, the problem device is 03:00.0, the Intel I211 NIC.  Here's the
> > > path to it:
> > >
> > >   00:01.2 Root Port              to [bus 01-07]
> > >   01:00.0 Switch Upstream Port   to [bus 02-07]
> > >   02:03.0 Switch Downstream Port to [bus 03]
> > >   03:00.0 Endpoint (Intel I211 NIC)
> > >
> > > And I think this is the relevant info:
> > >
> > >                                                     LnkCtl    LnkCtl
> > >            ------DevCap-------  ----LnkCap-------  -Before-  -After--
> > >   00:01.2                                L1 <32us       L1+       L1-
> > >   01:00.0                                L1 <32us       L1+       L1-
> > >   02:03.0                                L1 <32us       L1+       L1+
> > >   03:00.0  L0s <512ns L1 <64us  L0s <2us L1 <16us  L0s- L1-  L0s- L1-
> > >
> > > The NIC says it can tolerate at most 512ns of L0s exit latency and at
> > > most 64us of L1 exit latency.
> > >
> > > 02:03.0 doesn't support L0s, and the NIC itself can't exit L0s that
> > > fast anyway (it can only do <2us), so L0s should be out of the picture
> > > completely.
> > >
> > > Before your patch, apparently we (or BIOS) enabled L1 on the link from
> > > 00:01.2 to 01:00.0, and partially enabled it on the link from 02:03.0
> > > to 03:00.0.
> >
> > According to the spec, this is managed by the OS - which was the
> > change introduced...
>
> BIOS frequently enables ASPM and, if CONFIG_PCIEASPM_DEFAULT=y, I
> don't think Linux touches it unless a user requests it via sysfs.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=66ff14e59e8a

"7d715a6c1ae5 ("PCI: add PCI Express ASPM support") added the ability for
Linux to enable ASPM, but for some undocumented reason, it didn't enable
ASPM on links where the downstream component is a PCIe-to-PCI/PCI-X Bridge.

Remove this exclusion so we can enable ASPM on these links."

> What does "grep ASPM .config" tell you?

CONFIG_PCIEASPM=y
CONFIG_PCIEASPM_POWERSAVE=y

And all of this worked before the commit above.

> Boot with "pci=earlydump" and the dmesg will tell us what the BIOS
> did.
>
> If you do this in the unmodified kernel:
>
>   # echo 1 > /sys/.../0000:03:00.0/l1_aspm
>
> it should enable L1 for 03:00.0.  I'd like to know whether it actually
> does, and whether the NIC behaves any differently when L1 is enabled
> for the entire path instead of just the first three components.

With an unmodified kernel, I have ~40 mbit bandwidth.

> If the above doesn't work, you should be able to enable ASPM manually:
>
>   # setpci -s03:00.0 CAP_EXP+0x10.w=0x0042

I used something similar to disable ASPM, which made the nic work again.

> > > It looks like we *should* be able to enable L1 on both links since the
> > > exit latency should be <33us (first link starts exit at T=0, completes
> > > by T=32; second link starts exit at T=1, completes by T=33), and
> > > 03:00.0 can tolerate up to 64us.
> > >
> > > I guess the effect of your patch is to disable L1 on the 00:01.2 -
> > > 01:00.0 link?  And that makes the NIC work better?  I am obviously
> > > missing something because I don't understand why the patch does that
> > > or why it works better.
> >
> > It makes it work like normal again, like if i disable ASPM on the
> > nic itself...
>
> I wonder if ASPM is just broken on this device.
> __e1000e_disable_aspm() mentions hardware errata on a different Intel
> NIC.

I doubt it. The total time seems to surpass the time it can handle
with it's buffers.

Note that the nic is running with ASPM now, and it is working =)

> > I don't know which value that reflects, up or down - since we do max
> > of both values and
> > it actually disables ASPM.
> >
> > What we can see is that the first device that passes the threshold
> > is 01:00.0
>
> I don't understand this.  03:00.0 claims to be able to tolerate 64us
> of L1 exit latency.  The path should only have <33us of latency, so
> it's not even close to the 64us threshold.

Well, this is why i dumped the raw data - i don't know how lspci reads it

> > How can I read more data from PCIe without needing to add kprint...
> >
> > This is what lspci uses apparently:
> > #define  PCI_EXP_LNKCAP_L0S     0x07000 /* L0s Exit Latency */
> > #define  PCI_EXP_LNKCAP_L1      0x38000 /* L1 Exit Latency */
> >
> > But which latencies are those? up or down?
>
> I think the idea in aspm.c that exit latencies depend on which
> direction traffic is going is incorrect.  The components at the
> upstream and downstream ends of the link may have different exit
> latencies, of course.

Yes, and the max latency is the maximum time the device can buffer before
the link has to be up, so maximum time to establish link must be
max(up, down) right?



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux