Hi Lukas, Thanks for the review. On 6/15/23 11:35 AM, Lukas Wunner wrote: > On Wed, Jun 14, 2023 at 11:25:59PM -0700, Kuppuswamy Sathyanarayanan wrote: >> During the EDR-based DPC recovery process, for devices with persistent >> issues, the firmware may choose not to handle the DPC error and leave >> the port in DPC triggered state. In such scenarios, if the user >> replaces the faulty device with a new one, the OS is expected to clear >> the DPC trigger status in the hotplug error handler to enable the new >> device enumeration. > > You're clearing the DPC trigger status upon a PDC event, yet are saying > here the purpose is to reset port state for a future hotplugged device. Sorry, it is a typo. I meant "hotplug interrupt handler". The goal is to ensure that when a new device presence is detected, the old DPC trigger status is cleared. > > A PDC event may be synthesized, e.g. to trigger slot bringup via > sysfs, so using a PDC event to clear DPC trigger status feels wrong. IMO, it is harmless. We just want to make sure the previous DPC trigger status is cleared before enumerating a new device. > pciehp_unconfigure_device() seems like a more appropriate place to me. > I initially thought to add it there. Spec also recommends clearing it when removing the device. But I wasn't sure if pciehp_unconfigure_device() would be called only during device removal. Let me test this path and get back to you. > >> More details about this issue can be found in PCIe >> firmware specification, r3.3, sec titled "DPC Event Handling" >> Implementation note. > > That Implementation Note contains a lot of text and a fairly complex > flow chart. If you could point to specific paragraphs or numbers in > the Implementation Note that would make life easier for a reviewer > to make the connection between your code and the spec. It is the text at the end of the flowchart. Copied it here for reference. For devices with persistent errors, a port may be kept in the DPC triggered state (disabled) to keep those devices from continuing to generate errors. For hot-plug slots, the errant device may be removed and replaced with a new device. If the DPC trigger state is not cleared, then the port above the newly inserted device will still be disabled and will be non-operational. Therefore, operating systems may need to modify their hot-plug interrupt handling code to clear DPC Trigger Status when a device is removed so that a subsequent insertion will succeed. > > >> Similar issue might also happen if the DPC or EDR recovery handler >> exits before clearing the trigger status. To fix this issue, clear the >> DPC trigger status in PDC interrupt handler. > > I was about to ask why the code is added to dpc.c, not edr.c, > and why it's not constrained to CONFIG_PCIE_EDR, but I assume > that's the reason? Because it "might" happen for OS-native DPC > as well? Yes. There are code paths in the DPC driver where error recover handler can exit before clearing the DPC trigger status. So I think this fix is applicable for native code as well. > > >> +/** >> + * pci_reset_trigger - Clear DPC trigger status >> + * @pdev: PCI device >> + * >> + * It is called from the PCIe hotplug driver to clean the DPC >> + * trigger status in the PDC interrupt handler. >> + */ >> +void pci_dpc_reset_trigger(struct pci_dev *pdev) >> +{ >> + if (!pdev->dpc_cap) >> + return; >> + >> + pci_write_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_STATUS, >> + PCI_EXP_DPC_STATUS_TRIGGER); >> +} > > This may run concurrently to dpc_reset_link(), so I'd expect that > you need some kind of serialization. What happens if pciehp clears > trigger status behind the DPC driver's back while it is handling an > error? Currently, we only call pci_dpc_reset_trigger() in PDC interrupt handler. Do you think there would be a race between error handler and PDC handler? > > Thanks, > > Lukas -- Sathyanarayanan Kuppuswamy Linux Kernel Developer