On 5/9/24 1:48 AM, Zhenzhong Duan wrote: > When processing an ANFE, ideally both correctable error(CE) status and > uncorrectable error(UE) status should be cleared. However, there is no > way to fully identify the UE associated with ANFE. Even worse, Non-Fatal > Error(NFE) may set the same UE status bit as ANFE. Treating an ANFE as > NFE will bring some issues, i.e., breaking softwore probing; treating /s/softwore/software May be this is already discussed. But can you explain why treating AFNE as non-fatal error will bring probing issues? > NFE as ANFE will make us ignoring some UEs which need active recover /s/ignoring/ignore > operation. To avoid clearing UEs that are not ANFE by accident, the > most conservative route is taken here: If any of the NFE Detected bits > is set in Device Status, do not touch UE status, they should be cleared > later by the UE handler. Otherwise, a specific set of UEs that may be > raised as ANFE according to the PCIe specification will be cleared if > their corresponding severity is Non-Fatal. > > For instance, previously when kernel receives an ANFE with Poisoned TLP > in OS native AER mode, only status of CE will be reported and cleared: > > AER: Correctable error message received from 0000:b7:02.0 > PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) > device [8086:0db0] error status/mask=00002000/00000000 > [13] NonFatalErr > > If the kernel receives a Malformed TLP after that, two UEs will be > reported, which is unexpected. Malformed TLP Header is lost since > the previous ANFE gated the TLP header logs: > > PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) > device [8086:0db0] error status/mask=00041000/00180020 > [12] TLP (First) > [18] MalfTLP > > Now, for the same scenario, both CE status and related UE status will be > reported and cleared after ANFE: > > AER: Correctable error message received from 0000:b7:02.0 > PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID) > device [8086:0db0] error status/mask=00002000/00000000 > [13] NonFatalErr > Uncorrectable errors that may cause Advisory Non-Fatal: > [18] TLP > > Tested-by: Yudong Wang <yudong.wang@xxxxxxxxx> > Co-developed-by: "Wang, Qingshun" <qingshun.wang@xxxxxxxxxxxxxxx> > Signed-off-by: "Wang, Qingshun" <qingshun.wang@xxxxxxxxxxxxxxx> > Signed-off-by: Zhenzhong Duan <zhenzhong.duan@xxxxxxxxx> > --- > drivers/pci/pcie/aer.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index ed435f09ac27..6a6a3a40569a 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -1115,9 +1115,14 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info) > * Correctable error does not need software intervention. > * No need to go through error recovery process. > */ > - if (aer) > + if (aer) { > pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS, > info->status); > + if (info->anfe_status) > + pci_write_config_dword(dev, > + aer + PCI_ERR_UNCOR_STATUS, > + info->anfe_status); > + } Why split the handling part and storing part into two patches? Why not merge this part of patch 1/3. > if (pcie_aer_is_native(dev)) { > struct pci_driver *pdrv = dev->driver; > -- Sathyanarayanan Kuppuswamy Linux Kernel Developer