On Tue, Jul 28, 2020 at 01:24:10PM +0200, Lukas Wunner wrote: > When a PCIe card is hot-removed, the Presence Detect State and Data Link > Layer Link Active bits often do not clear simultaneously. I've seen > delays of up to 244 msec between the two events with Thunderbolt. > > After pciehp has brought down the slot in response to the first event, > the other bit may still be set. It's not discernible whether it's set > because a new card is already in the slot or if it will soon clear. > So pciehp tries to bring up the slot and in the latter case fails with > a bunch of messages, some of them at KERN_ERR severity. The messages > are false positives if the slot is no longer occupied and annoy users. > > Stuart Hayes reports the following splat on hot removal: > > KERN_INFO pcieport 0000:3c:06.0: pciehp: Slot(180): Link Up > KERN_INFO pcieport 0000:3c:06.0: pciehp: Timeout waiting for Presence Detect > KERN_ERR pcieport 0000:3c:06.0: pciehp: link training error: status 0x0001 > KERN_ERR pcieport 0000:3c:06.0: pciehp: Failed to check link status > > Dongdong Liu complains about a similar splat: > > KERN_INFO pciehp 0000:80:10.0:pcie004: Slot(36): Link Down > KERN_INFO iommu: Removing device 0000:87:00.0 from group 12 > KERN_INFO pciehp 0000:80:10.0:pcie004: Slot(36): Card present > KERN_INFO pcieport 0000:80:10.0: Data Link Layer Link Active not set in 1000 msec > KERN_ERR pciehp 0000:80:10.0:pcie004: Failed to check link status > > Users are particularly irritated to see a bringup attempt even though > the slot was explicitly brought down via sysfs. In a perfect world, we > could avoid this by setting Link Disable on slot bringdown and > re-enabling it upon a Presence Detect State change. In reality however, > there are broken hotplug ports which hardwire Presence Detect to zero, > see 80696f991424 ("PCI: pciehp: Tolerate Presence Detect hardwired to > zero"). Conversely, PCIe r1.0 hotplug ports hardwire Link Active to > zero because Link Active Reporting wasn't specified before PCIe r1.1. > On unplug, some ports first clear Presence then Link (see Stuart Hayes' > splat) whereas others use the inverse order (see Dongdong Liu's splat). > To top it off, there are hotplug ports which flap the Presence and Link > bits on slot bringup, see 6c35a1ac3da6 ("PCI: pciehp: Tolerate initially > unstable link"). > > pciehp is designed to work with all of these variants. Surplus attempts > at slot bringup are a lesser evil than not being able to bring up slots > at all. Although we could try to perfect the behavior for specific > hotplug controllers, we'd risk breaking others or increasing code > complexity. > > But we can certainly minimize annoyance by emitting only a single > message with KERN_INFO severity if bringup is unsuccessful: > > * Drop the "Timeout waiting for Presence Detect" message in > pcie_wait_for_presence(). The sole caller of that function, > pciehp_check_link_status(), ignores the timeout and carries on. > It emits error messages of its own and I don't think this particular > message adds much value. > > * There's a single error condition in pciehp_check_link_status() which > does not emit a message. Adding one allows dropping the "Failed to > check link status" message emitted by board_added() if > pciehp_check_link_status() returns a non-zero integer. > > * Tone down all messages in pciehp_check_link_status() to KERN_INFO > severity and rephrase them to look as innocuous as possible. To this > end, move the message emitted by pcie_wait_for_link_delay() to its > callers. > > As a result, Stuart Hayes' splat becomes: > > KERN_INFO pcieport 0000:3c:06.0: pciehp: Slot(180): Link Up > KERN_INFO pcieport 0000:3c:06.0: pciehp: Slot(180): Cannot train link: status 0x0001 > > Dongdong Liu's splat becomes: > > KERN_INFO pciehp 0000:80:10.0:pcie004: Slot(36): Card present > KERN_INFO pciehp 0000:80:10.0:pcie004: Slot(36): No link > > The messages now merely serve as information that presence or link bits > were set a little longer than expected. Bringup failures which are not > false positives are still reported, albeit no longer at KERN_ERR > severity. > > Link: https://lore.kernel.org/linux-pci/20200310182100.102987-1-stuart.w.hayes@xxxxxxxxx/ > Link: https://lore.kernel.org/linux-pci/1547649064-19019-1-git-send-email-liudongdong3@xxxxxxxxxx/ > Reported-by: Stuart Hayes <stuart.w.hayes@xxxxxxxxx> > Reported-by: Dongdong Liu <liudongdong3@xxxxxxxxxx> > Signed-off-by: Lukas Wunner <lukas@xxxxxxxxx> Reviewed-by: Mika Westerberg <mika.westerberg@xxxxxxxxxxxxxxx>