On Sat, Sep 18, 2021 at 6:09 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > On Thu, Sep 16, 2021 at 11:44:14PM +0800, Kai-Heng Feng wrote: > > The purpose of the series is to get comments and reviews so we can merge > > and test the series in downstream kernel. > > > > The latest Realtek vendor driver and its Windows driver implements a > > feature called "dynamic ASPM" which can improve performance on it's > > ethernet NICs. > > > > Heiner Kallweit pointed out the potential root cause can be that the > > buffer is too small for its ASPM exit latency. > > I looked at the lspci data in your bugzilla > (https://bugzilla.kernel.org/show_bug.cgi?id=214307). > > L1.2 is enabled, which requires the Latency Tolerance Reporting > capability, which helps determine when the Link will be put in L1.2. > IIUC, these are analogous to the DevCap "Acceptable Latency" values. > Zero latency values indicate the device will be impacted by any delay > (PCIe r5.0, sec 6.18). > > Linux does not currently program those values, so the values there > must have been set by the BIOS. On the working AMD system, they're > set to 1048576ns, while on the broken Intel system, they're set to > 3145728ns. > > I don't really understand how these values should be computed, and I > think they depend on some electrical characteristics of the Link, so > I'm not sure it's *necessarily* a problem that they are different. > But a 3X difference does seem pretty large. > > So I'm curious whether this is related to the problem. Here are some > things we could try on the broken Intel system: Original network speed, tested via iperf3: TX: ~255 Mbps RX: ~490 Mbps > > - What happens if you disable ASPM L1.2 using > /sys/devices/pci*/.../link/l1_2_aspm? TX: ~670 Mbps RX: ~670 Mbps > > - If that doesn't work, what happens if you also disable PCI-PM L1.2 > using /sys/devices/pci*/.../link/l1_2_pcipm? Same as only disables l1_2_aspm. > > - If either of the above makes things work, then at least we know > the problem is sensitive to L1.2. Right now the downstream kernel disables ASPM L1.2 as workaround. > > - Then what happens if you use setpci to set the LTR Latency > registers to 0, then re-enable ASPM L1.2 and PCI-PM L1.2? This > should mean the Realtek device wants the best possible service and > the Link probably won't spend much time in L1.2. # setpci -s 01:00.0 ECAP_LTR+4.w=0x0 # setpci -s 01:00.0 ECAP_LTR+6.w=0x0 Then re-enable ASPM L1.2, the issue persists - the network speed is still very slow. > > - What happens if you set the LTR Latency registers to 0x1001 > (should be the same as on the AMD system)? Same slow speed here. Kai-Heng