On Mon, Jun 17, 2024 at 03:51:57PM -0700, Smita Koralahalli wrote: > Hi Bjorn, > > On 6/17/2024 1:09 PM, Bjorn Helgaas wrote: > > On Thu, May 16, 2024 at 08:47:48PM +0000, Smita Koralahalli wrote: > > > Clear Link Bandwidth Management Status (LBMS) if set, on a hot-remove > > > event. > > > > > > The hot-remove event could result in target link speed reduction if LBMS > > > is set, due to a delay in Presence Detect State Change (PDSC) happening > > > after a Data Link Layer State Change event (DLLSC). > > > > > > In reality, PDSC and DLLSC events rarely come in simultaneously. Delay in > > > PDSC can sometimes be too late and the slot could have already been > > > powered down just by a DLLSC event. And the delayed PDSC could falsely be > > > interpreted as an interrupt raised to turn the slot on. This false process > > > of powering the slot on, without a link forces the kernel to retrain the > > > link if LBMS is set, to a lower speed to restablish the link thereby > > > bringing down the link speeds [2]. > > > > Not sure we need PDSC and DLLSC details to justify clearing LBMS if it > > has no meaning for an empty slot? > > I'm trying to also answer your below question here.. > > >I guess the slot is empty, so retraining > > is meaningless and will always fail. Maybe avoiding it avoids a > > delay? Is the benefit that we get rid of the message and a delay?" > > The benefit of this patch is to "avoid link speed drops" on a hot remove > event if LBMS is set and DLLLA is clear. But I'm not trying to solve delay > issues here.. > > I included the PDSC and DLLSC details as they are the cause for link speed > drops on a remove event. On an empty slot, DLLLA is cleared and LBMS may or > may not be set. And, we see cases of link speed drops here, if PDSC happens > on an empty slot. > > We know for the fact that slot becomes empty if either of the events PDSC or > DLLSC occur. Also, either of them do not wait for the other to bring down > the device and mark the slot as "empty". That is the reason I was also > thinking of waiting on both events PDSC and DLLSC to bring down the device > as I mentioned in my comments in V1. We could eliminate link speed drops by > taking this approach as well. But then we had to address cases where PDSC is > hardwired to zero. > > On our AMD systems, we expect to see both events on an hot-remove event. > But, mostly we see DLLSC happening first, which does the job of marking the > slot empty. Now, if the PDSC event is delayed way too much and if it occurs > after the slot becomes empty, kernel misinterprets PDSC as the signal to > re-initialize the slot and this is the sequence of steps the kernel takes: > > pciehp_handle_presence_or_link_change() > pciehp_enable_slot() > __pciehp_enable_slot() > board_added > pciehp_check_link_status() > pcie_wait_for_link() > pcie_wait_for_link_delay() > pcie_failed_link_retrain() > > while doing so, it hits the case of DLLLA clear and LBMS set and brings down > the speeds. So I guess LBMS currently remains set after a device has been removed, so the slot is empty, and later when a device is hot-added, *that* device sees a lower-than expected link speed? > The issue of PDSC and DLLSC never occurring simultaneously was a known thing > from before and it wasn't breaking anything functionally as the kernel would > just exit with the message: "No link" at pciehp_check_link_status(). > > However, Commit a89c82249c37 ("PCI: Work around PCIe link training > failures") introduced link speed downgrades to re-establish links if LBMS is > set, and DLLLA is clear. This caused the devices to operate at 2.5GT/s after > they were plugged-in which were previously operating at higher speeds before > hot-remove. > > > > According to PCIe r6.2 sec 7.5.3.8 [1], it is derived that, LBMS cannot > > > be set for an unconnected link and if set, it serves the purpose of > > > indicating that there is actually a device down an inactive link. > > > > I see that r6.2 added an implementation note about DLLSC, but I'm not > > a hardware person and can't follow the implication about a device > > present down an inactive link. > > > > I guess it must be related to the fact that LBMS indicates either > > completion of link retraining or a change in link speed or width > > (which would imply presence of a downstream device). But in both > > cases I assume the link would be active. > > > > But IIUC LBMS is set by hardware but never cleared by hardware, so if > > we remove a device and power off the slot, it doesn't seem like LBMS > > could be telling us anything useful (what could we do in response to > > LBMS when the slot is empty?), so it makes sense to me to clear it. > > > > It seems like pciehp_unconfigure_device() does sort of PCI core and > > driver-related things and possibly could be something shared by all > > hotplug drivers, while remove_board() does things more specific to the > > hotplug model (pciehp, shpchp, etc). > > > > From that perspective, clearing LBMS might fit better in > > remove_board(). In that case, I wonder whether it should be done > > after turning off slot power? This patch clears is *before* turning > > off the power, so I wonder if hardware could possibly set it again > > before the poweroff? > > Yeah by talking to HW people I realized that HW could interfere possibly > anytime to set LBMS when the slot power is on. Will change it to include in > remove_board(). > > > > However, hardware could have already set LBMS when the device was > > > connected to the port i.e when the state was DL_Up or DL_Active. Some > > > hardwares would have even attempted retrain going into recovery mode, > > > just before transitioning to DL_Down. > > > > > > Thus the set LBMS is never cleared and might force software to cause link > > > speed drops when there is no link [2]. > > > > > > Dmesg before: > > > pcieport 0000:20:01.1: pciehp: Slot(59): Link Down > > > pcieport 0000:20:01.1: pciehp: Slot(59): Card present > > > pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s > > > pcieport 0000:20:01.1: retraining failed > > > pcieport 0000:20:01.1: pciehp: Slot(59): No link > > > > > > Dmesg after: > > > pcieport 0000:20:01.1: pciehp: Slot(59): Link Down > > > pcieport 0000:20:01.1: pciehp: Slot(59): Card present > > > pcieport 0000:20:01.1: pciehp: Slot(59): No link > > > > I'm a little confused about the problem being solved here. Obviously > > the message is extraneous. I guess the slot is empty, so retraining > > is meaningless and will always fail. Maybe avoiding it avoids a > > delay? Is the benefit that we get rid of the message and a delay? > > > > > [1] PCI Express Base Specification Revision 6.2, Jan 25 2024. > > > https://members.pcisig.com/wg/PCI-SIG/document/20590 > > > [2] Commit a89c82249c37 ("PCI: Work around PCIe link training failures") > > > > > > Fixes: a89c82249c37 ("PCI: Work around PCIe link training failures") > > > > Lukas asked about this; did you confirm that it is related? Asking > > because the Fixes tag may cause this to be backported along with > > a89c82249c37. > > Yeah, without this patch we won't see link speed drops. > > Thanks, > Smita > > > > > > Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@xxxxxxx> > > > --- > > > Link to v1: > > > https://lore.kernel.org/all/20240424033339.250385-1-Smita.KoralahalliChannabasappa@xxxxxxx/ > > > > > > v2: > > > Cleared LBMS unconditionally. (Ilpo) > > > Added Fixes Tag. (Lukas) > > > --- > > > drivers/pci/hotplug/pciehp_pci.c | 3 +++ > > > 1 file changed, 3 insertions(+) > > > > > > diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c > > > index ad12515a4a12..dae73a8932ef 100644 > > > --- a/drivers/pci/hotplug/pciehp_pci.c > > > +++ b/drivers/pci/hotplug/pciehp_pci.c > > > @@ -134,4 +134,7 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence) > > > } > > > pci_unlock_rescan_remove(); > > > + > > > + pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA, > > > + PCI_EXP_LNKSTA_LBMS); > > > } > > > -- > > > 2.17.1 > > >