Keith - I understand that the RP did not detect the error and so nothing to clear in its AER register. My question is - where is the fatal error register cleared in the device's (the device that was the cause of the fata error) AER register? It does not seem to be done in pci_do_recovery walking the hierarchy (unless I'm missing it).... > -----Original Message----- > From: Keith Busch <kbusch@xxxxxxxxxx> > Sent: Saturday, March 13, 2021 12:12 PM > To: James Puthukattukaran <james.puthukattukaran@xxxxxxxxxx> > Cc: Kelley, Sean V <sean.v.kelley@xxxxxxxxx>; Kuppuswamy, > Sathyanarayanan <sathyanarayanan.kuppuswamy@xxxxxxxxx>; Linux PCI > <linux-pci@xxxxxxxxxxxxxxx>; bhelgaas@xxxxxxxxxx > Subject: [External] : Re: pci_do_recovery not handling fata errors > > On Fri, Mar 12, 2021 at 10:57:18PM +0000, James Puthukattukaran wrote: > > But the clearing of fatal error in the dpc_process_error is only for DPC > trigger due to "unmaskable uncorrectable". > > If the trigger reason is ERR_FATAL, then it does not hit the else clause and > neither is it cleared in the pci_do_recovery code. > > If the reason is ERR_FATAL, then the port didn't detect the error; it is just the > first DPC capable downstream port to receive the message from some device > downstream, so there's nothing to clear in its AER register. > > > From dpc_process_error with more context -- > > > > else if (reason == 0 && <<<<<<< only for "unmaskable uncorrectable". > What about for ERR_FATAL? > > dpc_get_aer_uncorrect_severity(pdev, &info) && > > aer_get_device_error_info(pdev, &info)) { > > aer_print_error(pdev, &info); > > pci_aer_clear_nonfatal_status(pdev); > > pci_aer_clear_fatal_status(pdev); > > } > > > > > > > -----Original Message----- > > > From: Kelley, Sean V <sean.v.kelley@xxxxxxxxx> > > > Sent: Friday, March 12, 2021 5:25 PM > > > To: James Puthukattukaran <james.puthukattukaran@xxxxxxxxxx>; > > > Kuppuswamy, Sathyanarayanan > > > <sathyanarayanan.kuppuswamy@xxxxxxxxx> > > > Cc: Linux PCI <linux-pci@xxxxxxxxxxxxxxx>; bhelgaas@xxxxxxxxxx > > > Subject: [External] : Re: pci_do_recovery not handling fata errors > > > > > > > > > > > > > On Mar 12, 2021, at 12:56 PM, James Puthukattukaran > > > <james.puthukattukaran@xxxxxxxxxx> wrote: > > > > > > > > Hi - > > > > I’m trying to understand why pci_do_recovery() only clears > > > > non-fatal but > > > not fata errors? My immediate concern is call from dpc_handler. If a > > > device sends an ERR_FATAL to the root port, I would think that as > > > part of recovery the fatal status in the AER registers of the endpoint > device would be cleared? > > > > > > > > > > > > > Adding Sathya who mentioned to me that: > > > > > > Fatal error are cleared in > > > > > > void dpc_process_error(struct pci_dev *pdev) > > > > > > 253 dpc_get_aer_uncorrect_severity(pdev, &info) && > > > 254 aer_get_device_error_info(pdev, &info)) { > > > 255 aer_print_error(pdev, &info); > > > 256 pci_aer_clear_nonfatal_status(pdev); > > > 257 pci_aer_clear_fatal_status(pdev); > > > > > > Thanks, > > > > > > Sean > > > > > > > Snippet of concern in pci_do_recovery – > > > > > > > > /* > > > > * If we have native control of AER, clear error status in the Root > > > > * Port or Downstream Port that signaled the error. If the > > > > * platform retained control of AER, it is responsible for clearing > > > > * this status. In that case, the signaling device may not even be > > > > * visible to the OS. > > > > */ > > > > if (host->native_aer || pcie_ports_native) { > > > > pcie_clear_device_status(bridge); > > > > pci_aer_clear_nonfatal_status(bridge); <<<< Just clearing > > > nonfatal. What about fatal? > > > > } > > > > > > > > Thanks > > > > James > >