On Thu, Mar 4, 2021 at 3:19 PM Kuppuswamy, Sathyanarayanan <sathyanarayanan.kuppuswamy@xxxxxxxxx> wrote: > > > On 3/4/21 2:59 PM, Dan Williams wrote: > > On Thu, Mar 4, 2021 at 2:38 PM Kuppuswamy, Sathyanarayanan > > <sathyanarayanan.kuppuswamy@xxxxxxxxx> wrote: > >> > >> On 3/4/21 2:11 PM, Dan Williams wrote: > >> > >> On Thu, Mar 4, 2021 at 12:03 PM Keith Busch <kbusch@xxxxxxxxxx> wrote: > >> > >> On Tue, Mar 02, 2021 at 09:46:40PM -0800, Kuppuswamy, Sathyanarayanan wrote: > >> > >> On 3/2/21 9:34 PM, Williams, Dan J wrote: > >> > >> [ Add Sathya ] > >> > >> On Mon, 2021-01-04 at 15:02 -0800, Keith Busch wrote: > >> > >> Overwriting the frozen detected status with the result of the link reset > >> loses the NEED_RESET result that drivers are depending on for error > >> handling to report the .slot_reset() callback. Retain this status so > >> that subsequent error handling has the correct flow. > >> > >> Reported-by: Hinko Kocevar <hinko.kocevar@xxxxxx> > >> Acked-by: Sean V Kelley <sean.v.kelley@xxxxxxxxx> > >> Signed-off-by: Keith Busch <kbusch@xxxxxxxxxx> > >> > >> Just want to report that this fix might be a candidate for -stable. > >> > >> Agree. > >> > >> I think it can be merged in both stable and mainline kernels. > >> > >> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx> > >> > >> Just FYI, this patch is practically a revert of this one: > >> > >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=6d2c89441571ea534d6240f7724f518936c44f8d > >> > >> so please let me know if that is still a problem for you. > >> > >> For what it's worth I think "6d2c89441571 PCI/ERR: Update error status > >> after reset_link()" is not justified. The link shouldn't recover if > >> the attached device is not prepared to handle DPC events. > >> > >> I added that fix to address the recovery issue seen in a Dell server > >> platform (for EDR test case). If I understand the history correctly, > >> In EDR case, AER and DPC is owned by firmware, hence we get > >> PCI_ERS_RESULT_NO_AER_DRIVER when executing error_detected() callbacks. > >> So If we continue the pcie_do_recovery() with PCI_ERS_RESULT_NO_AER_DRIVER > >> as error status, then even if we successfully reset the link we will report > >> the recovery status as failure. > > But that's the right response if there is no handler. > If the handler is not available due to AER being owned by firmware, > then it needs to be fixed. In EDR mode, even if DPC/AER is owned > by firmware , OS need to own the recovery part. So I think it > needs further investigation to understand why it reports, > PCI_ERS_RESULT_NO_AER_DRIVER As far as I can see the only way to get PCI_ERS_RESULT_NO_AER_DRIVER is when there actually is no handler, or the device io state has set to failed. I notice the hotplug handler sets the device io state to failed while processing link down. If the device is actually missing a handler definition then disconnect seems the right response.