On Fri, 1 Mar 2024, Maciej W. Rozycki wrote: > On Fri, 1 Mar 2024, Ilpo Järvinen wrote: > > > Besides Link Training bit, pcie_retrain_link() can also be asked to > > wait for Data Link Layer Link Active bit (PCIe r6.1 sec 7.5.3.8) using > > 'use_lt' parameter since the merge commit 1abb47390350 ("Merge branch > > 'pci/enumeration'"). > > Nope, it was added with commit 680e9c47a229 ("PCI: Add support for > polling DLLLA to pcie_retrain_link()"), before said merge. Ah sorry, my wording was not good here, I meant on the line I was changing in the patch and that line didn't exist in 680e9c47a229 at all. So yes, DLLLA and use_lt waiting was added in 680e9c47a229 but the merge commit brought the implementation note related code into pcie_retrain_link() which I think was mismerged when it comes to use_lt. > > pcie_retrain_link() first tries to ensure Link Training is not > > currently active (see Implementation Note in PCIe r6.1 sec 7.5.3.7) > > which must always check Link Training bit regardless of 'use_lt'. > > Correct the pcie_wait_for_link_status() parameters to only wait for > > the correct bit to ensure there is no ongoing Link Training. > > You're talking the PCIe spec here and code is talking a bad device case. > > > Since waiting for Data Link Layer Link Active bit is only used for the > > Target Speed quirk, this only impacts the case when the quirk attempts > > to restore speed to higher than 2.5 GT/s (The link is Up at that point > > so pcie_retrain_link() will fail). > > NAK. It's used for both clamping and unclamping and it will break the > workaround, because the whole point there is to wait until DLLA has been > set. Using LT is not reliable because it will oscillate in the failure > case and seeing the bit clear does not mean link has been established. In pcie_retrain_link(), there are two calls into pcie_wait_for_link_status() and the second one of them is meant to implement the link-has-been-established check. The first wait call comes from e7e39756363a ("PCI/ASPM: Avoid link retraining race") and is just to ensure the link is not ongoing retraining to make sure the latest configuration in captured as required by the implementation note. LT being cleared is exactly what is wanted for that check because it means that any earlier retraining has ended (a new one might be starting but that doesn't matter, we saw it cleared so the new configuration should be in effect for that instance of link retraining). So my point is, the first check is not even meant to check that link has been established. > What are you trying to achieve here and what problem is it to fix? Actually, I misthought what it breaks so the problem is not well described above but I still think it is broken: In the case of quirk, before 2.5GT/s is attempted DLLLA is not set, right? Then quirk sets 2.5GT/s target speed and calls into pcie_retrain_link(). The first call into pcie_wait_for_link_status() is made with (..., false, true) which waits until DLLLA is set but this occurs before OS even triggered Link Retraining. Since there's no retraining commanded by the OS, DLLLA will not get set, the wait will fail and error is returned, and the quirk too returns failure. It could of course occur that because of the HW retraining loop (independent of OS control), the link retrains itselfs to 2.5GT/s without OS asking for it just by OS setting the target speed alone, which is well possible given the HW behavior in your target speed quirk case is not fully understood. Even if that's the case, it seems not good to rely on the HW originating retraining loop triggering the link retraining that is necessary here. Maybe this is far fetched thought but perhaps it could explain why you didn't get the link up with your setup when you tried to test it earlier. Alternative approach to fix this problem would be to not make the first call into pcie_wait_for_link_status() at all when use_lt = false. Of course, I cannot test this with your configuration so I cannot confirm how the target speed quirk behaves, I just found it by reading the code. The current code does not make sense because the first wait is supposed to wait for LT bit, not for DLLLA. -- i.