On 7/20/2021 1:57 AM, Lukas Wunner wrote:
On Mon, Jul 19, 2021 at 02:00:51PM -0500, stuart hayes wrote:
On 7/19/2021 10:10 AM, Lukas Wunner wrote:
Could you test if the below patch fixes the issue?
That does appear to fix the issue, thanks! Without your patch, the PCIe
devices under 64:02.0 disappear (the triggered bit is still set in the DPC
capability). With your patch, recovery is successful and all of the PCIe
devices are still there.
Thanks for testing.
The test patch clears DLLSC because the Hot Reset that is propagated
down the hierarchy causes the link to flap. I'm wondering though if
that's sufficient or if PDC needs to be cleared as well. According
to PCIe Base Spec sec. 4.2.6, LTSSM transitions from "Hot Reset" state
to "Detect", then "Polling". If I understand the table "Link Status
Mapped to the LTSSM" in the spec correctly, in-band presence is 0b
in Detect state, hence I'd expect PDC to flap as well as a result of
a Hot Reset being propagated down the hierarchy.
I think the table "Link Status Mapped to the LTSSM" is saying that when
in-band presence is 0, the LTSSM state must be "Detect" (not that being
in "Detect" will force in-band presence to zero).
I would not expect PDC to flap since the presence detect (even in-band)
should not go away during hot reset.
On the system I'm using, I modified the kernel to read and print the
slot status register right before your test patch clears DLLSC, and it
reads 0x140 (link status changed, presence is detected, but PDC is not set).
Does the hotplug port at 0000:68:00.0 support In-Band Presence Disable?
That would explain why only clearing DLLSC is sufficient.
No... the slot capabilities 2 register is 0.
The problem is, if PDC is cleared as well, we lose the ability to
detect that a device was hot-removed while the reset was ongoing,
which is unfortunate.
Agreed, but I don't think PDC should get set on hot reset.
If an error is handled by aer_root_reset() (instead of dpc_reset_link())
and the reset is performed at a hotplug port, then pciehp_reset_slot()
is invoked:
aer_root_reset()
pci_bus_error_reset()
pci_slot_reset()
pci_reset_hotplug_slot()
pciehp_reset_slot()
pciehp_reset_slot() temporarily masks both DLLSC *and* PDC events,
then performs a Secondary Bus Reset at the hotplug port.
If there are further hotplug ports below that hotplug port
where the SBR is performed, my expectation is that the Hot Reset
is likewise propagated down the hierarchy (just as with DPC),
so those cascaded hotplug ports should also see their link go down.
In other words, the issue you're seeing isn't really DPC-specific.
However, the test patch should fix the issue for AER-handled errors
as well. Do you agree with this analysis or did I miss anything?
That looks correct to me.
Thanks,
Lukas