Re: [bugzilla-daemon@xxxxxxxxxx: [Bug 216511] New: Spurious PCI_EXP_SLTSTA_DLLSC when hot plugging]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Sep 21, 2022 at 01:03:26PM -0500, Bjorn Helgaas wrote:
> On Wed, Sep 21, 2022 at 06:40:20AM -0500, Bjorn Helgaas wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=216511
[...]
> Here's the call chain when handling that DLL state change:
> 
>   pciehp_ist
>     pcie_capability_read_word(pdev, PCI_EXP_SLTSTA, &status)
>     status &= ... PCI_EXP_SLTSTA_DLLSC
>     events |= status
>     if (events & PCI_EXP_SLTSTA_DLLSC)
>       pciehp_handle_presence_or_link_change
>         pciehp_disable_slot
>           __pciehp_disable_slot
>             remove_board
>               pciehp_unconfigure_device
>                 pci_stop_and_remove_bus_device
> 
> Per spec, "software must read the Data Link Layer Link Active bit of
> the Link Status Register to determine if the Link is active before
> initiating configuration cycles to the hot plugged device" (PCIe r6.0,
> sec 7.5.3.11).
> 
> It looks like Linux depends on PCI_EXP_SLTSTA_DLLSC but does not
> actually read PCI_EXP_LNKSTA in this path, so this looks like a pciehp
> defect.

I disagree.  The spec citation pertains to *bringup* of the slot,
but this is the bringdown code path.

The logic in pciehp is such that if we receive DLLSC or PDC and the
slot is up, we always bring it down.  Only then do we check whether
the slot is occupied or link is up.  If that's the case, we attempt
to bring the slot up again.

pciehp assumes that the card may have changed when it receives DLLSC
or PDC.  That's the rationale behind this behavior.

In theory one might think that if DLLSC is received without a concurrent
PDC event, then the card in the slot is still the same and only the
link went down (probably flapped).  Unfortunately the reality is not
that simple:  For one, DLLSC and PDC events may come in arbitrary order
and with quite a delay between them.  Second, there are broken slots
which hardwire PDC to 0 and we support those.  So we can't reliably
determine if presence hasn't changed and only link has.

In this particular case, the PEX switch is clearly broken because it
shouldn't signal DLLSC both for a slot where the link change occurred
and its sibling.

A while ago Jon Derrick submitted a patch for a similar problem:
A bifurcated SSD where bringing down one half of the SSD results
in a spurious DLLSC event for the other half:

https://lore.kernel.org/linux-pci/20210830155628.130054-1-jonathan.derrick@xxxxxxxxx/

I'm not really happy with that patch because it adds a quirk
in the middle of the code path for interpreting slot events
which makes it difficult to reason about the code's correctness.

I'm starting to wonder if instead of Jon's patch, we should
just disable DLLSC events on broken devices such as this
PEX switch or Jon's SSD.  We'd only rely on PDC then but
that's probably sufficient.  And the code changes would be
less intrusive.

FWIW, Jon is still interested in upstreaming his quirk:

https://lore.kernel.org/linux-pci/446a21e2-aea2-773f-ca88-b6676b54b292@xxxxxxxxx/

@Richard:  I think Jon's patch doesn't solve your issue does it?
Because I think the issue he's seeing is slightly different
albeit likewise caused by unreliable DLLSC.  (His pertains to
bringdown, yours to bringup of the slot it seems.)

Thanks,

Lukas



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux