On Fri, Sep 25, 2020 at 5:49 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > [+cc linux-pci, others again] > > On Fri, Sep 25, 2020 at 03:54:11PM +0200, Ian Kumlien wrote: > > On Fri, Sep 25, 2020 at 3:39 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > > > > > On Fri, Sep 25, 2020 at 12:18:50PM +0200, Ian Kumlien wrote: > > > > So....... > > > > [ 0.815843] pci 0000:04:00.0: L1 latency exceeded - path: 1000 - max: 64000 > > > > [ 0.815843] pci 0000:00:01.2: Upstream device - 32000 > > > > [ 0.815844] pci 0000:01:00.0: Downstream device - 32000 > > > > > > Wait a minute. I've been looking at *03:00.0*, not 04:00.0. Based > > > on your bugzilla, here's the path: > > > > Correct, or you could do it like this: > > 00:01.2/01:00.0/02:03.0/03:00.0 Ethernet controller: Intel Corporation > > I211 Gigabit Network Connection (rev 03) > > > > > 00:01.2 Root Port to [bus 01-07] > > > 01:00.0 Switch Upstream Port to [bus 02-07] > > > 02:03.0 Switch Downstream Port to [bus 03] > > > 03:00.0 Endpoint (Intel I211 NIC) > > > > > > Your system does also have an 04:00.0 here: > > > > > > 00:01.2 Root Port to [bus 01-07] > > > 01:00.0 Switch Upstream Port to [bus 02-07] > > > 02:04.0 Switch Downstream Port to [bus 04] > > > 04:00.0 Endpoint (Realtek 816e) > > > 04:00.1 Endpoint (Realtek RTL8111/8168/8411 NIC) > > > 04:00.2 Endpoint (Realtek 816a) > > > 04:00.4 Endpoint (Realtek 816d EHCI USB) > > > 04:00.7 Endpoint (Realtek 816c IPMI) > > > > > > Which NIC is the problem? > > > > The intel one - so the side effect of the realtek nic is that it fixes > > the intel nics issues. > > > > It would be that the intel nic actually has a bug with L1 (and it > > would seem that it's to kind with latencies) so it actually has a > > smaller buffer... > > > > And afair, the realtek has a larger buffer, since it behaves better > > with L1 enabled. > > > > Either way, it's a fix that's needed ;) > > OK, what specifically is the fix that's needed? The L0s change seems > like a "this looks wrong" thing that doesn't actually affect your > situation, so let's skip that for now. L1 latency calculation is not good enough, it assumes that *any* link is the highest latency link - which is incorrect. The latency to bring L1 up is number-of-hops*1000 + maximum-latency-along-the-path currently it's only doing number-of-hops*1000 + arbitrary-latency-of-current-link > And let's support the change you *do* need with the "lspci -vv" for > all the relevant devices (including both 03:00.0 and 04:00.x, I guess, > since they share the 00:01.2 - 01:00.0 link), before and after the > change. They are all included in all lspci output in the bug > I want to identify something in the "before" configuration that is > wrong according to the spec, and a change in the "after" config so it > now conforms to the spec. So there are a few issues here, the current code does not apply to spec. The intel nic gets fixed as a side effect (it should still get a proper fix) of making the code apply to spec. Basically, while hunting for the issue, I found that the L1 and L0s latency calculations used to determine if they should be enabled or not is wrong - that is what I'm currently trying to push - it also seems like the intel nic claims that it can handle 64us but apparently it can't. So, three bugs, two are "fixed" one needs additional fixing. > Bjorn