Re: nvme-pci: Disabling device after reset failure: -5 occurs while AER recovery

Lukas Wunner <lukas@xxxxxxxxx> · Sat, 11 Mar 2023 09:22:20 +0100

On Fri, Mar 10, 2023 at 05:45:48PM -0800, Tushar Dave wrote:
> On 3/10/2023 3:53 PM, Bjorn Helgaas wrote:
> > In the log below, pciehp obviously is enabled; should I infer that in
> > the log above, it is not?
> 
> pciehp is enabled all the time. In the log above and below.
> I do not have answer yet why pciehp shows-up only in some tests (due to DPC
> link down/up) and not in others like you noticed in both the logs.

Maybe some of the switch Downstream Ports are hotplug-capable and
some are not?  (Check the Slot Implemented bit in the PCI Express
Capabilities Register as well as the Hot-Plug Capable bit in the
Slot Capabilities Register.)

> > Generally we've avoided handling a device reset as a remove/add event
> > because upper layers can't deal well with that.  But in the log below
> > it looks like pciehp *did* treat the DPC containment as a remove/add,
> > which of course involves configuring the "new" device and its MPS
> > settings.
> 
> yes and that puzzled me why? especially when"Link Down/Up ignored (recovered
> by DPC)". Do we still have race somewhere, I am not sure.

You're seeing the expected behavior.  pciehp ignores DLLSC events
caused by DPC, but then double-checks that DPC recovery succeeded.
If it didn't, it would be a bug not to bring down the slot.
So pciehp does exactly that.  See this code snippet in
pciehp_ignore_dpc_link_change():

	/*
	 * If the link is unexpectedly down after successful recovery,
	 * the corresponding link change may have been ignored above.
	 * Synthesize it to ensure that it is acted on.
	 */
	down_read_nested(&ctrl->reset_lock, ctrl->depth);
	if (!pciehp_check_link_active(ctrl))
		pciehp_request(ctrl, PCI_EXP_SLTSTA_DLLSC);
	up_read(&ctrl->reset_lock);

So on hotplug-capable ports, pciehp is able to mop up the mess created
by fiddling with the MPS settings behind the kernel's back.

We don't have that option on non-hotplug-capable ports.  If error
recovery fails, we generally let the inaccessible devices remain
in the system and user interaction is necessary to recover, either
through a reboot or by manually removing and rescanning PCI devices
via syfs after reinstating sane MPS settings.

>   - Switch and NVMe MPS are 512B
>   - NVMe config space saved (including MPS=512B)
>   - You change Switch MPS to 128B
>   - NVMe does DMA with payload > 128B
>   - Switch reports Malformed TLP because TLP is larger than its MPS
>   - Recovery resets NVMe, which sets MPS to the default of 128B
>   - nvme_slot_reset() restores NVMe config space (MPS is now 512B)
>   - Subsequent NVMe DMA with payload > 128B repeats cycle

Forgive my ignorance, but if MPS is restored to 512B by nvme_slot_reset(),
shouldn't the communication with the device just work again from that
point on?

Thanks,

Lukas