On Mon, Sep 28, 2020 at 1:33 PM Ian Kumlien <ian.kumlien@xxxxxxxxx> wrote: > > On Mon, Sep 28, 2020 at 10:04 PM Ian Kumlien <ian.kumlien@xxxxxxxxx> wrote: > > > > On Mon, Sep 28, 2020 at 9:53 PM Alexander Duyck > > <alexander.duyck@xxxxxxxxx> wrote: <snip> > > > You should be able to manually disable L1 on the realtek link > > > (4:00.0<->2:04.0) instead of doing it on the upstream link on the > > > switch. That may provide a datapoint on the L1 behavior of the setup. > > > Basically if you took the realtek out of the equation in terms of the > > > L1 exit time you should see the exit time drop to no more than 33us > > > like what would be expected with just the i210. > > > > Yeah, will try it out with echo 0 > > > /sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:04.0/0000:04:00.0/link/l1_aspm > > (which is the device reported by my patch) > > So, 04:00.0 is already disabled, the existing code apparently handled > that correctly... *but* > > given the path: > 00:01.2/01:00.0/02:04.0/04:00.0 Unassigned class [ff00]: Realtek > Semiconductor Co., Ltd. Device 816e (rev 1a) > > Walking backwards: > -- 04:00.0 has l1 disabled > -- 02:04.0 doesn't have aspm?! > > lspci reports: > Capabilities: [370 v1] L1 PM Substates > L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1+ L1_PM_Substates+ > L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- > L1SubCtl2: > Capabilities: [400 v1] Data Link Feature <?> > Capabilities: [410 v1] Physical Layer 16.0 GT/s <?> > Capabilities: [440 v1] Lane Margining at the Receiver <?> > > However the link directory is empty. > > Anything we should know about these unknown capabilities? also aspm > L1.1 and .1.2, heh =) > > -- 01:00.0 has L1, disabling it makes the intel nic work again I recall that much. However the question is why? If there is already a 32us time to bring up the link between the NIC and the switch why would the additional 1us to also bring up the upstream port have that much of an effect? That is why I am thinking that it may be worthwhile to try to isolate things further so that only the upstream port and the NIC have L1 enabled. If we are still seeing issues in that state then I can only assume there is something off with the 00:01.2<->1:00.0 link to where it either isn't advertising the actual L1 recovery time. For example the "Width x4 (downgraded)" looks very suspicious and could be responsible for something like that if the link training is having to go through exception cases to work out the x4 link instead of a x8. > ASPM L1 enabled: > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 5.40 MBytes 45.3 Mbits/sec 0 62.2 KBytes > [ 5] 1.00-2.00 sec 4.47 MBytes 37.5 Mbits/sec 0 70.7 KBytes > [ 5] 2.00-3.00 sec 4.10 MBytes 34.4 Mbits/sec 0 42.4 KBytes > [ 5] 3.00-4.00 sec 4.47 MBytes 37.5 Mbits/sec 0 65.0 KBytes > [ 5] 4.00-5.00 sec 4.47 MBytes 37.5 Mbits/sec 0 105 KBytes > [ 5] 5.00-6.00 sec 4.47 MBytes 37.5 Mbits/sec 0 84.8 KBytes > [ 5] 6.00-7.00 sec 4.47 MBytes 37.5 Mbits/sec 0 65.0 KBytes > [ 5] 7.00-8.00 sec 4.10 MBytes 34.4 Mbits/sec 0 45.2 KBytes > [ 5] 8.00-9.00 sec 4.47 MBytes 37.5 Mbits/sec 0 56.6 KBytes > [ 5] 9.00-10.00 sec 4.47 MBytes 37.5 Mbits/sec 0 48.1 KBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 44.9 MBytes 37.7 Mbits/sec 0 sender > [ 5] 0.00-10.01 sec 44.0 MBytes 36.9 Mbits/sec receiver > > ASPM L1 disabled: > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 111 MBytes 935 Mbits/sec 733 761 KBytes > [ 5] 1.00-2.00 sec 110 MBytes 923 Mbits/sec 733 662 KBytes > [ 5] 2.00-3.00 sec 109 MBytes 912 Mbits/sec 1036 1.20 MBytes > [ 5] 3.00-4.00 sec 109 MBytes 912 Mbits/sec 647 738 KBytes > [ 5] 4.00-5.00 sec 110 MBytes 923 Mbits/sec 852 744 KBytes > [ 5] 5.00-6.00 sec 109 MBytes 912 Mbits/sec 546 908 KBytes > [ 5] 6.00-7.00 sec 109 MBytes 912 Mbits/sec 303 727 KBytes > [ 5] 7.00-8.00 sec 109 MBytes 912 Mbits/sec 432 769 KBytes > [ 5] 8.00-9.00 sec 110 MBytes 923 Mbits/sec 462 652 KBytes > [ 5] 9.00-10.00 sec 109 MBytes 912 Mbits/sec 576 764 KBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 1.07 GBytes 918 Mbits/sec 6320 sender > [ 5] 0.00-10.01 sec 1.06 GBytes 912 Mbits/sec receiver > > (all measurements are over live internet - so thus variance) I forgot there were 5 total devices that were hanging off of there as well. You might try checking to see if disabling L1 on devices 5:00.0, 6:00.0 and/or 7:00.0 has any effect while leaving the L1 on 01:00.0 and the NIC active. The basic idea is to go through and make certain we aren't seeing an L1 issue with one of the other downstream links on the switch. The more I think about it the entire setup for this does seem a bit suspicious. I was looking over the lspci tree and the dump from the system. From what I can tell the upstream switch link at 01.2 <-> 1:00.0 is only a Gen4 x4 link. However coming off of that is 5 devices, two NICs using either Gen1 or 2 at x1, and then a USB controller and 2 SATA controller reporting Gen 4 x16. Specifically those last 3 devices have me a bit curious as they are all reporting L0s and L1 exit latencies that are the absolute minimum which has me wondering if they are even reporting actual values.