Re: [PATCH v2] PCI: pciehp: Clear LBMS on hot-remove to prevent link speed reduction

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Bjorn,

On 6/17/2024 1:09 PM, Bjorn Helgaas wrote:
On Thu, May 16, 2024 at 08:47:48PM +0000, Smita Koralahalli wrote:
Clear Link Bandwidth Management Status (LBMS) if set, on a hot-remove
event.

The hot-remove event could result in target link speed reduction if LBMS
is set, due to a delay in Presence Detect State Change (PDSC) happening
after a Data Link Layer State Change event (DLLSC).

In reality, PDSC and DLLSC events rarely come in simultaneously. Delay in
PDSC can sometimes be too late and the slot could have already been
powered down just by a DLLSC event. And the delayed PDSC could falsely be
interpreted as an interrupt raised to turn the slot on. This false process
of powering the slot on, without a link forces the kernel to retrain the
link if LBMS is set, to a lower speed to restablish the link thereby
bringing down the link speeds [2].

Not sure we need PDSC and DLLSC details to justify clearing LBMS if it
has no meaning for an empty slot?

I'm trying to also answer your below question here..

>I guess the slot is empty, so retraining
> is meaningless and will always fail.  Maybe avoiding it avoids a
> delay?  Is the benefit that we get rid of the message and a delay?"

The benefit of this patch is to "avoid link speed drops" on a hot remove event if LBMS is set and DLLLA is clear. But I'm not trying to solve delay issues here..

I included the PDSC and DLLSC details as they are the cause for link speed drops on a remove event. On an empty slot, DLLLA is cleared and LBMS may or may not be set. And, we see cases of link speed drops here, if PDSC happens on an empty slot.

We know for the fact that slot becomes empty if either of the events PDSC or DLLSC occur. Also, either of them do not wait for the other to bring down the device and mark the slot as "empty". That is the reason I was also thinking of waiting on both events PDSC and DLLSC to bring down the device as I mentioned in my comments in V1. We could eliminate link speed drops by taking this approach as well. But then we had to address cases where PDSC is hardwired to zero.

On our AMD systems, we expect to see both events on an hot-remove event. But, mostly we see DLLSC happening first, which does the job of marking the slot empty. Now, if the PDSC event is delayed way too much and if it occurs after the slot becomes empty, kernel misinterprets PDSC as the signal to re-initialize the slot and this is the sequence of steps the kernel takes:

pciehp_handle_presence_or_link_change()
  pciehp_enable_slot()
    __pciehp_enable_slot()
        board_added
          pciehp_check_link_status()
            pcie_wait_for_link()
              pcie_wait_for_link_delay()
                pcie_failed_link_retrain()

while doing so, it hits the case of DLLLA clear and LBMS set and brings down the speeds.

The issue of PDSC and DLLSC never occurring simultaneously was a known thing from before and it wasn't breaking anything functionally as the kernel would just exit with the message: "No link" at pciehp_check_link_status().

However, Commit a89c82249c37 ("PCI: Work around PCIe link training failures") introduced link speed downgrades to re-establish links if LBMS is set, and DLLLA is clear. This caused the devices to operate at 2.5GT/s after they were plugged-in which were previously operating at higher speeds before hot-remove.


According to PCIe r6.2 sec 7.5.3.8 [1], it is derived that, LBMS cannot
be set for an unconnected link and if set, it serves the purpose of
indicating that there is actually a device down an inactive link.

I see that r6.2 added an implementation note about DLLSC, but I'm not
a hardware person and can't follow the implication about a device
present down an inactive link.

I guess it must be related to the fact that LBMS indicates either
completion of link retraining or a change in link speed or width
(which would imply presence of a downstream device).  But in both
cases I assume the link would be active.

But IIUC LBMS is set by hardware but never cleared by hardware, so if
we remove a device and power off the slot, it doesn't seem like LBMS
could be telling us anything useful (what could we do in response to
LBMS when the slot is empty?), so it makes sense to me to clear it.

It seems like pciehp_unconfigure_device() does sort of PCI core and
driver-related things and possibly could be something shared by all
hotplug drivers, while remove_board() does things more specific to the
hotplug model (pciehp, shpchp, etc).

 From that perspective, clearing LBMS might fit better in
remove_board().  In that case, I wonder whether it should be done
after turning off slot power?  This patch clears is *before* turning
off the power, so I wonder if hardware could possibly set it again
before the poweroff?

Yeah by talking to HW people I realized that HW could interfere possibly anytime to set LBMS when the slot power is on. Will change it to include in remove_board().


However, hardware could have already set LBMS when the device was
connected to the port i.e when the state was DL_Up or DL_Active. Some
hardwares would have even attempted retrain going into recovery mode,
just before transitioning to DL_Down.

Thus the set LBMS is never cleared and might force software to cause link
speed drops when there is no link [2].

Dmesg before:
	pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
	pcieport 0000:20:01.1: pciehp: Slot(59): Card present
	pcieport 0000:20:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
	pcieport 0000:20:01.1: retraining failed
	pcieport 0000:20:01.1: pciehp: Slot(59): No link

Dmesg after:
	pcieport 0000:20:01.1: pciehp: Slot(59): Link Down
	pcieport 0000:20:01.1: pciehp: Slot(59): Card present
	pcieport 0000:20:01.1: pciehp: Slot(59): No link

I'm a little confused about the problem being solved here.  Obviously
the message is extraneous.  I guess the slot is empty, so retraining
is meaningless and will always fail.  Maybe avoiding it avoids a
delay?  Is the benefit that we get rid of the message and a delay?

[1] PCI Express Base Specification Revision 6.2, Jan 25 2024.
     https://members.pcisig.com/wg/PCI-SIG/document/20590
[2] Commit a89c82249c37 ("PCI: Work around PCIe link training failures")

Fixes: a89c82249c37 ("PCI: Work around PCIe link training failures")

Lukas asked about this; did you confirm that it is related?  Asking
because the Fixes tag may cause this to be backported along with
a89c82249c37.

Yeah, without this patch we won't see link speed drops.

Thanks,
Smita


Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@xxxxxxx>
---
Link to v1:
https://lore.kernel.org/all/20240424033339.250385-1-Smita.KoralahalliChannabasappa@xxxxxxx/

v2:
	Cleared LBMS unconditionally. (Ilpo)
	Added Fixes Tag. (Lukas)
---
  drivers/pci/hotplug/pciehp_pci.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
index ad12515a4a12..dae73a8932ef 100644
--- a/drivers/pci/hotplug/pciehp_pci.c
+++ b/drivers/pci/hotplug/pciehp_pci.c
@@ -134,4 +134,7 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence)
  	}
pci_unlock_rescan_remove();
+
+	pcie_capability_write_word(ctrl->pcie->port, PCI_EXP_LNKSTA,
+				   PCI_EXP_LNKSTA_LBMS);
  }
--
2.17.1





[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux