On 06.09.2021 17:10, Kai-Heng Feng wrote: > On Sat, Sep 4, 2021 at 4:00 AM Heiner Kallweit <hkallweit1@xxxxxxxxx> wrote: >> >> On 03.09.2021 17:56, Kai-Heng Feng wrote: >>> On Tue, Aug 31, 2021 at 2:09 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: >>>> >>>> On Sat, Aug 28, 2021 at 01:14:52AM +0800, Kai-Heng Feng wrote: >>>>> r8169 NICs on some platforms have abysmal speed when ASPM is enabled. >>>>> Same issue can be observed with older vendor drivers. >>>>> >>>>> The issue is however solved by the latest vendor driver. There's a new >>>>> mechanism, which disables r8169's internal ASPM when the NIC traffic has >>>>> more than 10 packets, and vice versa. The possible reason for this is >>>>> likely because the buffer on the chip is too small for its ASPM exit >>>>> latency. >>>> >>>> This sounds like good speculation, but of course, it would be better >>>> to have the supporting data. >>>> >>>> You say above that this problem affects r8169 on "some platforms." I >>>> infer that ASPM works fine on other platforms. It would be extremely >>>> interesting to have some data on both classes, e.g., "lspci -vv" >>>> output for the entire system. >>> >>> lspci data collected from working and non-working system can be found here: >>> https://bugzilla.kernel.org/show_bug.cgi?id=214307 >>> >>>> >>>> If r8169 ASPM works well on some systems, we *should* be able to make >>>> it work well on *all* systems, because the device can't tell what >>>> system it's in. All the device can see are the latencies for entry >>>> and exit for link states. >>> >>> That's definitely better if we can make r8169 ASPM work for all platforms. >>> >>>> >>>> IIUC this patch makes the driver wake up every 1000ms. If the NIC has >>>> sent or received more than 10 packets in the last 1000ms, it disables >>>> ASPM; otherwise it enables ASPM. >>> >>> Yes, that's correct. >>> >>>> >>>> I asked these same questions earlier, but nothing changed, so I won't >>>> raise them again if you don't think they're pertinent. Some patch >>>> splitting comments below. >>> >>> Sorry about that. The lspci data is attached. >>> >> >> Thanks for the additional details. I see that both systems have the L1 >> sub-states active. Do you also face the issue if L1 is enabled but >> L1.2 and L1.2 are not? Setting the ASPM policy from powersupersave >> to powersave should be sufficient to disable them. >> I have a test system Asus PRIME H310I-PLUS, BIOS 2603 10/21/2019 with >> the same RTL8168h chip version. With L1 active and sub-states inactive >> everything is fine. With the sub-states activated I get few missed RX >> errors when running iperf3. > > Once L1.1 and L1.2 are disabled the TX speed can reach 710Mbps and RX > can reach 941 Mbps. So yes it seems to be the same issue. I reach 940-950Mbps in both directions, but this seems to be unrelated to what we discuss here. > With dynamic ASPM, TX can reach 750 Mbps while ASPM L1.1 and L1.2 are enabled. > >> One difference between your good and bad logs is the following. >> (My test system shows the same LTR value like your bad system.) >> >> Bad: >> Capabilities: [170 v1] Latency Tolerance Reporting >> Max snoop latency: 3145728ns >> Max no snoop latency: 3145728ns >> >> Good: >> Capabilities: [170 v1] Latency Tolerance Reporting >> Max snoop latency: 1048576ns >> Max no snoop latency: 1048576ns >> >> I have to admit that I'm not familiar with LTR and don't know whether >> this difference could contribute to the differing behavior. > > I am also unsure what role LTR plays here, so I tried to change the > LTR value to 1048576ns and yield the same result, the TX and RX remain > very slow. > > Kai-Heng >