Hi Keith, A gentle ping. I don't know whether you got a chance to review the last email on this subject. Please let me know your feedback/thoughts. Thanks, --Nilay On 3/13/24 17:29, Nilay Shroff wrote: > > > On 3/12/24 20:00, Keith Busch wrote: >> On Mon, Mar 11, 2024 at 06:28:21PM +0530, Nilay Shroff wrote: >>> @@ -3295,10 +3304,13 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev, >>> case pci_channel_io_frozen: >>> dev_warn(dev->ctrl.device, >>> "frozen state error detected, reset controller\n"); >>> - if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { >>> - nvme_dev_disable(dev, true); >>> - return PCI_ERS_RESULT_DISCONNECT; >>> + if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) { >>> + if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { >>> + nvme_dev_disable(dev, true); >>> + return PCI_ERS_RESULT_DISCONNECT; >>> + } >>> } >>> + flush_work(&dev->ctrl.reset_work); >> >> I was messing with a similar idea. I wasn't sure if EEH calls the error >> handler inline with the error, in which case this would try to flush the >> work within the same work, which obviously doesn't work. As long as its >> called from a different thread, then this should be fine. > The EEH recovery happens from different thread and so flush work should > work here as expected. > >>> @@ -2776,6 +2776,16 @@ static void nvme_reset_work(struct work_struct *work) >>> out_unlock: >>> mutex_unlock(&dev->shutdown_lock); >>> out: >>> + /* >>> + * If PCI recovery is ongoing then let it finish first >>> + */ >>> + if (pci_channel_offline(to_pci_dev(dev->dev))) { >>> + dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n"); >>> + if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) >>> + WRITE_ONCE(dev->ctrl.state, NVME_CTRL_RESETTING); >> >> This may break the state machine, like if the device was hot removed >> during all this error handling. This will force the state back to >> RESETTING when it should be DEAD. > Agreed, we shouldn't force reset state to RESETTING here. >> >> I think what you need is just allow a controller to reset from a >> connecting state. Have to be careful that wouldn't break any other >> expectations, though. > Yeah ok got your point, so I have reworked the ctrl state machine as you > suggested and tested the changes. The updated state machine code is shown > below: > > @@ -546,10 +546,11 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, > break; > case NVME_CTRL_RESETTING: > switch (old_state) { > case NVME_CTRL_NEW: > case NVME_CTRL_LIVE: > + case NVME_CTRL_CONNECTING: > changed = true; > fallthrough; > default: > break; > } > > And accordingly updated reset_work function is show below. Here we ensure that > even though the pci error recovery is in progress, if we couldn't move ctrl state > to RESETTING then we would let rest_work forward progress and set the ctrl state > to DEAD. > > @@ -2774,10 +2774,19 @@ static void nvme_reset_work(struct work_struct *work) > return; > > out_unlock: > mutex_unlock(&dev->shutdown_lock); > out: > + /* > + * If PCI recovery is ongoing then let it finish first > + */ > + if (pci_channel_offline(to_pci_dev(dev->dev))) { > + if (nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) { > + dev_warn(dev->ctrl.device, "PCI recovery is ongoing, let it finish\n"); > + return; > + } > + } > /* > * Set state to deleting now to avoid blocking nvme_wait_reset(), which > * may be holding this pci_dev's device lock. > */ > dev_warn(dev->ctrl.device, "Disabling device after reset failure: %d\n", > > > Now I have also included in my test the hot removal of NVMe adapter while EEH recovery > is in progress. And the EEH recovery code handles this case well : When EEH recovery > is in progress and if we hot removes the adapter (which is being recovered) then EEH > handler would stop the recovery, set the PCI channel state to "pci_channel_io_perm_failure". > > > Collected the logs of this case, (shown below): > ----------------------------------------------- > # nvme list-subsys > nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358 > hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2 > iopolicy=numa > > # lspci > 0018:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X > > # dmesg > [ 561.639102] EEH: Recovering PHB#18-PE#10000 > [ 561.639120] EEH: PE location: N/A, PHB location: N/A > [ 561.639128] EEH: Frozen PHB#18-PE#10000 detected > <snip> > <snip> > [ 561.639428] EEH: This PCI device has failed 2 times in the last hour and will be permanently disabled after 5 failures. > [ 561.639441] EEH: Notify device drivers to shutdown > [ 561.639450] EEH: Beginning: 'error_detected(IO frozen)' > [ 561.639458] PCI 0018:01:00.0#10000: EEH: Invoking nvme->error_detected(IO frozen) > [ 561.639462] nvme nvme0: frozen state error detected, reset controller > [ 561.719078] nvme 0018:01:00.0: enabling device (0000 -> 0002) > [ 561.719318] nvme nvme0: PCI recovery is ongoing so let it finish > > <!!!! WHILE EEH RECOVERY IS IN PROGRESS WE HOT REMOVE THE NVMe ADAPTER !!!!> > > [ 563.850328] rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1 > <snip> > [ 571.879092] PCI 0018:01:00.0#10000: EEH: nvme driver reports: 'need reset' > [ 571.879097] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset' > <snip> > <snip> > [ 571.881761] EEH: Reset without hotplug activity > [ 574.039807] EEH: PHB#18-PE#10000: Slot inactive after reset: 0x2 (attempt 1) > [ 574.309091] EEH: Failure 2 resetting PHB#18-PE#10000 (attempt 2) > [ 574.309091] > [ 574.579094] EEH: Failure 2 resetting PHB#18-PE#10000 (attempt 3) > [ 574.579094] > [ 574.579101] eeh_handle_normal_event: Cannot reset, err=-5 > [ 574.579104] EEH: Unable to recover from failure from PHB#18-PE#10000. > [ 574.579104] Please try reseating or replacing it > <snip> > <snip> > [ 574.581314] EEH: Beginning: 'error_detected(permanent failure)' > [ 574.581320] PCI 0018:01:00.0#10000: EEH: no device > > # lspci > <empty> > > # nvme list-subsys > <empty> > > > Another case tested, when the reset_work is ongoing post subsystem-reset command > and if user immediately hot removes the NVMe adapter then EEH recovery is *not* > triggered and ctrl forward progress to the "DEAD" state. > > Collected the logs of this case, (shown below): > ----------------------------------------------- > # nvme list-subsys > nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358 > hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2 > iopolicy=numa > > # lspci > 0018:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X > > # nvme subsystem-reset /dev/nvme0 > > <!!!! HOT REMOVE THE NVMe ADAPTER !!!!> > > # dmesg > [ 9967.143886] eeh_handle_normal_event: Cannot find PCI bus for PHB#18-PE#10000 > [ 9967.224078] eeh_handle_normal_event: Cannot find PCI bus for PHB#18-PE#10000 > <snip> > [ 9967.223858] nvme 0018:01:00.0: enabling device (0000 -> 0002) > [ 9967.224140] nvme nvme0: Disabling device after reset failure: -19 > > # lspci > <empty> > > # nvme list-subsys > <empty> > > > Please let me know if the above changes look good to you. If it looks good then > I would spin a new version of the patch and send for a review. > > Thanks, > --Nilay > >