On Wed, 7 Aug 2024, Matthew W Carlis wrote: > > For the quirk to trigger, the link has to be down and there has to be the > > LBMS Link Status bit set from link management events as per the PCIe spec > > while the link was previously up, and then both of that while rescanning > > the PCIe device in question, so there's a lot of conditions to meet. > > If there is nothing clearing the bit then why is there any expectation that > it wouldn't be set? We have 20/30/40 endpoints in a host that can be hot-removed, > hot-added at any point in time in any combination & its often the case that > the system uptime be hundreds of days. Eventually the bit will just become set > as a result of time and scale. Well, in principle in a setup with reliable links the LBMS bit may never be set, e.g. this system of mine has been in 24/7 operation since the last reboot 410 days ago and for the devices that support Link Active reporting it shows: LnkSta:Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt- LnkSta:Speed 5GT/s, Width x8, TrErr- Train- SlotClk- DLActive+ BWMgmt+ ABWMgmt+ LnkSta:Speed 5GT/s, Width x1, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt+ LnkSta:Speed 5GT/s, Width x2, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt+ LnkSta:Speed 5GT/s, Width x1, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt+ LnkSta:Speed 8GT/s, Width x16, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt+ LnkSta:Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt- LnkSta:Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt- LnkSta:Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt- LnkSta:Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt- LnkSta:Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt- so out of 11 devices 6 have the LBMS bit clear. But then 5 have it set, perhaps worryingly, so of course you're right, that it will get set in the field, though it's not enough by itself for your problem to trigger. Then there is manual link retraining, which we do from time to time as well, which will set the bit too, which I overlooked. To cut the story short, it was an oversight of mine, especially as I don't really make any use myself of hot plug scenarios. > > The reason for this is safety: it's better to have a link run at 2.5GT/s > > than not at all, and from the nature of the issue there is no guarantee > > that if you remove the speed clamp, then the link won't go back into the > > infinite retraining loop that the workaround is supposed to break. > > I guess I don't really understand why it wouldn't be safe for every device > other than the ASMedia since they aren't the device that had the issue in the > first place. The main problem in my mind is the system doesn't actually have to > be retraining at all, it can occur the first time you replace a device or > power cycle the device or the first time the device goes into DPC & comes back. > If the quirk simply returned without doing anything when the ASMedia is not in the > allow-list it would make me more at ease. I can also imagine some other implementations > that would work well, but it doesn't seem correct that we could only give a single > opportunity to a device before forcing it to live at Gen1. Perhaps it should be > aware of the rate or a count or something... It's a complex matter. For a starter please read the introduction at `pcie_failed_link_retrain'. When the problem triggers the link goes into an infinite link training loop, with the Link Training (LT) bit flipping on and off. I've made a complementary workaround for U-Boot (since your bootstrap device can be downstream such a broken link), where a statistical probe is done for the LT bit flipping as discovered by polling the bit in a tight loop. This is fine for a piece of firmware that has nothing better to do, but in an OS kernel we just can't do it. Also U-Boot does not remove the 2.5GT/s clamp because it does not have to set up the system in an optimal way, but just sufficiently to successfully boot. An OS kernel can then adjust the configuration if it knows about the workaround, or leave it as it is. In the latter case things will continue working across PCIe resets, because the TLS field is sticky. For Linux I went for just checking the DLLLA bit as it seems good enough for devices that support Link Active reporting. And I admit that the workaround is quite aggressive, but this was a deliberate decision following the robustness principle: the end user may not be qualified enough to diagnose the problem and given its nature not even think it's something that can be made to work. The downstream device does not show up in the PCIe hierarchy and just looks like it's broken. It takes a lot of knowledge and experience to notice the LT bit flipping and even myself, quite a seasoned Linux kernel developer, only noticed it by chance. It was discussed to death with the original submission, the rationale is given in the introductory comment and I'd prefer that it stays as it is. The ASM2824 switch works just fine with other downstream devices and so does the PI7C9X2G304 switch with other upstream devices. Both vendors refused to help narrow it down and declined to comment (a person from Diodes née Pericom replied, but they learnt it's about a PCIe switch they told me they were from another department and couldn't help), which would have helped (I now just recommend staying away from both manufacturers). Therefore my assumption is this can potentially trigger with any pair of devices. I have now posted a patch series to address this problem of yours and a previous issue reported by Ilpo. See: <https://patchwork.kernel.org/project/linux-pci/list/?series=878216>, and I've included you in the list of direct recipients too. I will appreciate if you give it a try, and again, apologies for the trouble caused, but such things happen in engineering. Maciej