Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset

Keith Busch <kbusch@xxxxxxxxxx> · Fri, 8 Mar 2024 08:41:31 -0700

On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote:
> @@ -2776,6 +2776,14 @@ static void nvme_reset_work(struct work_struct *work)
>   out_unlock:
>  	mutex_unlock(&dev->shutdown_lock);
>   out:
> +	/*
> +	 * If PCI recovery is ongoing then let it finish first
> +	 */
> +	if (pci_channel_offline(to_pci_dev(dev->dev))) {
> +		dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n");
> +		return;
> +	}
> +
>  	/*
>  	 * Set state to deleting now to avoid blocking nvme_wait_reset(), which
>  	 * may be holding this pci_dev's device lock.
> @@ -3295,9 +3303,11 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
>  	case pci_channel_io_frozen:
>  		dev_warn(dev->ctrl.device,
>  			"frozen state error detected, reset controller\n");
> -		if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
> -			nvme_dev_disable(dev, true);
> -			return PCI_ERS_RESULT_DISCONNECT;
> +		if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) {
> +			if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
> +				nvme_dev_disable(dev, true);
> +				return PCI_ERS_RESULT_DISCONNECT;
> +			}
>  		}
>  		nvme_dev_disable(dev, false);
>  		return PCI_ERS_RESULT_NEED_RESET;

I get what you're trying to do, but it looks racy. The reset_work may
finish before pci sets channel offline, or the error handling work
happens to see RESETTING state, but then transitions to CONNECTING state
after and deadlocks on the '.resume()' side. You are counting on a very
specific sequence tied to the PCIe error handling module, and maybe you
are able to count on that sequence for your platform in this unique
scenario, but these link errors could happen anytime.

And nvme subsystem reset is just odd, it's not clear how it was intended
to be handled. It takes the links down so seems like it requires
re-enumeration from a pcie hotplug driver, and that's kind of how it was
expected to work here, but your platform has a special way to contain
the link event and bring things back up the way they were before. And
the fact you *require* IO to be in flight just so the timeout handler
can dispatch a non-posted transaction 30 seconds later to trigger EEH is
also odd. Why can't EEH just detect the link down event directly?

This driver unfortunately doesn't handle errors during a reset well.
Trying to handle that has been problematic, so the driver just bails if
anything goes wrong at this critical initialization point. Maybe we need
to address the reset/initialization failure handling more generically
and delegate the teardown or retry decision to something else. Messing
with that is pretty fragile right now, though.

Or you could just re-enumerate the slot.

I don't know, sorry my message is not really helping much to get this
fixed.