On Fri, Mar 10, 2023 at 05:45:48PM -0800, Tushar Dave wrote: > On 3/10/2023 3:53 PM, Bjorn Helgaas wrote: > > In the log below, pciehp obviously is enabled; should I infer that in > > the log above, it is not? > > pciehp is enabled all the time. In the log above and below. > I do not have answer yet why pciehp shows-up only in some tests (due to DPC > link down/up) and not in others like you noticed in both the logs. Maybe some of the switch Downstream Ports are hotplug-capable and some are not? (Check the Slot Implemented bit in the PCI Express Capabilities Register as well as the Hot-Plug Capable bit in the Slot Capabilities Register.) > > Generally we've avoided handling a device reset as a remove/add event > > because upper layers can't deal well with that. But in the log below > > it looks like pciehp *did* treat the DPC containment as a remove/add, > > which of course involves configuring the "new" device and its MPS > > settings. > > yes and that puzzled me why? especially when"Link Down/Up ignored (recovered > by DPC)". Do we still have race somewhere, I am not sure. You're seeing the expected behavior. pciehp ignores DLLSC events caused by DPC, but then double-checks that DPC recovery succeeded. If it didn't, it would be a bug not to bring down the slot. So pciehp does exactly that. See this code snippet in pciehp_ignore_dpc_link_change(): /* * If the link is unexpectedly down after successful recovery, * the corresponding link change may have been ignored above. * Synthesize it to ensure that it is acted on. */ down_read_nested(&ctrl->reset_lock, ctrl->depth); if (!pciehp_check_link_active(ctrl)) pciehp_request(ctrl, PCI_EXP_SLTSTA_DLLSC); up_read(&ctrl->reset_lock); So on hotplug-capable ports, pciehp is able to mop up the mess created by fiddling with the MPS settings behind the kernel's back. We don't have that option on non-hotplug-capable ports. If error recovery fails, we generally let the inaccessible devices remain in the system and user interaction is necessary to recover, either through a reboot or by manually removing and rescanning PCI devices via syfs after reinstating sane MPS settings. > - Switch and NVMe MPS are 512B > - NVMe config space saved (including MPS=512B) > - You change Switch MPS to 128B > - NVMe does DMA with payload > 128B > - Switch reports Malformed TLP because TLP is larger than its MPS > - Recovery resets NVMe, which sets MPS to the default of 128B > - nvme_slot_reset() restores NVMe config space (MPS is now 512B) > - Subsequent NVMe DMA with payload > 128B repeats cycle Forgive my ignorance, but if MPS is restored to 512B by nvme_slot_reset(), shouldn't the communication with the device just work again from that point on? Thanks, Lukas