But the clearing of fatal error in the dpc_process_error is only for DPC trigger due to "unmaskable uncorrectable". If the trigger reason is ERR_FATAL, then it does not hit the else clause and neither is it cleared in the pci_do_recovery code. >From dpc_process_error with more context -- else if (reason == 0 && <<<<<<< only for "unmaskable uncorrectable". What about for ERR_FATAL? dpc_get_aer_uncorrect_severity(pdev, &info) && aer_get_device_error_info(pdev, &info)) { aer_print_error(pdev, &info); pci_aer_clear_nonfatal_status(pdev); pci_aer_clear_fatal_status(pdev); } > -----Original Message----- > From: Kelley, Sean V <sean.v.kelley@xxxxxxxxx> > Sent: Friday, March 12, 2021 5:25 PM > To: James Puthukattukaran <james.puthukattukaran@xxxxxxxxxx>; > Kuppuswamy, Sathyanarayanan > <sathyanarayanan.kuppuswamy@xxxxxxxxx> > Cc: Linux PCI <linux-pci@xxxxxxxxxxxxxxx>; bhelgaas@xxxxxxxxxx > Subject: [External] : Re: pci_do_recovery not handling fata errors > > > > > On Mar 12, 2021, at 12:56 PM, James Puthukattukaran > <james.puthukattukaran@xxxxxxxxxx> wrote: > > > > Hi - > > I’m trying to understand why pci_do_recovery() only clears non-fatal but > not fata errors? My immediate concern is call from dpc_handler. If a device > sends an ERR_FATAL to the root port, I would think that as part of recovery > the fatal status in the AER registers of the endpoint device would be cleared? > > > > > Adding Sathya who mentioned to me that: > > Fatal error are cleared in > > void dpc_process_error(struct pci_dev *pdev) > > 253 dpc_get_aer_uncorrect_severity(pdev, &info) && > 254 aer_get_device_error_info(pdev, &info)) { > 255 aer_print_error(pdev, &info); > 256 pci_aer_clear_nonfatal_status(pdev); > 257 pci_aer_clear_fatal_status(pdev); > > Thanks, > > Sean > > > Snippet of concern in pci_do_recovery – > > > > /* > > * If we have native control of AER, clear error status in the Root > > * Port or Downstream Port that signaled the error. If the > > * platform retained control of AER, it is responsible for clearing > > * this status. In that case, the signaling device may not even be > > * visible to the OS. > > */ > > if (host->native_aer || pcie_ports_native) { > > pcie_clear_device_status(bridge); > > pci_aer_clear_nonfatal_status(bridge); <<<< Just clearing > nonfatal. What about fatal? > > } > > > > Thanks > > James