>-----Original Message----- >From: Kuppuswamy Sathyanarayanan ><sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx> >Subject: Re: [PATCH v4 3/3] PCI/AER: Clear UNCOR_STATUS bits that might >be ANFE > > >On 6/13/24 7:40 PM, Duan, Zhenzhong wrote: >> >>> -----Original Message----- >>> From: Kuppuswamy, Sathyanarayanan >>> Subject: Re: [PATCH v4 3/3] PCI/AER: Clear UNCOR_STATUS bits that >might >>> be ANFE >>> >>> >>> On 5/9/24 1:48 AM, Zhenzhong Duan wrote: >>>> When processing an ANFE, ideally both correctable error(CE) status and >>>> uncorrectable error(UE) status should be cleared. However, there is no >>>> way to fully identify the UE associated with ANFE. Even worse, Non-Fatal >>>> Error(NFE) may set the same UE status bit as ANFE. Treating an ANFE as >>>> NFE will bring some issues, i.e., breaking softwore probing; treating >>> /s/softwore/software >> Good catch, will fix. It's strange 'checkpatch --codespell' doesn't catch this. >> >>> May be this is already discussed. But can you explain why treating >>> AFNE as non-fatal error will bring probing issues? >> Copied below from spec 6.1, 6.2.3.2.4, says it can results in a System Error. >> >> In some cases the detector of a non-fatal error is not the most appropriate >agent to determine whether the error is >> recoverable or not, or if it even needs any recovery action at all. For >example, if software attempts to perform a >> configuration read from a non-existent device or Function, the resulting UR >Status in the Completion will signal the error >> to software, and software does not need for the Completer in addition to >signal the error by sending an ERR_NONFATAL >> Message. In fact, on some platforms, signaling the error with >ERR_NONFATAL results in a System Error, which breaks >> normal software probing. >> >>>> NFE as ANFE will make us ignoring some UEs which need active recover >>> /s/ignoring/ignore >> Will fix. >> >>>> operation. To avoid clearing UEs that are not ANFE by accident, the >>>> most conservative route is taken here: If any of the NFE Detected bits >>>> is set in Device Status, do not touch UE status, they should be cleared >>>> later by the UE handler. Otherwise, a specific set of UEs that may be >>>> raised as ANFE according to the PCIe specification will be cleared if >>>> their corresponding severity is Non-Fatal. >>>> >>>> For instance, previously when kernel receives an ANFE with Poisoned >TLP >>>> in OS native AER mode, only status of CE will be reported and cleared: >>>> >>>> AER: Correctable error message received from 0000:b7:02.0 >>>> PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver >ID) >>>> device [8086:0db0] error status/mask=00002000/00000000 >>>> [13] NonFatalErr >>>> >>>> If the kernel receives a Malformed TLP after that, two UEs will be >>>> reported, which is unexpected. Malformed TLP Header is lost since >>>> the previous ANFE gated the TLP header logs: >>>> >>>> PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer, >>> (Receiver ID) >>>> device [8086:0db0] error status/mask=00041000/00180020 >>>> [12] TLP (First) >>>> [18] MalfTLP >>>> >>>> Now, for the same scenario, both CE status and related UE status will be >>>> reported and cleared after ANFE: >>>> >>>> AER: Correctable error message received from 0000:b7:02.0 >>>> PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver >ID) >>>> device [8086:0db0] error status/mask=00002000/00000000 >>>> [13] NonFatalErr >>>> Uncorrectable errors that may cause Advisory Non-Fatal: >>>> [18] TLP >>>> >>>> Tested-by: Yudong Wang <yudong.wang@xxxxxxxxx> >>>> Co-developed-by: "Wang, Qingshun" <qingshun.wang@xxxxxxxxxxxxxxx> >>>> Signed-off-by: "Wang, Qingshun" <qingshun.wang@xxxxxxxxxxxxxxx> >>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@xxxxxxxxx> >>>> --- >>>> drivers/pci/pcie/aer.c | 7 ++++++- >>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c >>>> index ed435f09ac27..6a6a3a40569a 100644 >>>> --- a/drivers/pci/pcie/aer.c >>>> +++ b/drivers/pci/pcie/aer.c >>>> @@ -1115,9 +1115,14 @@ static void pci_aer_handle_error(struct >>> pci_dev *dev, struct aer_err_info *info) >>>> * Correctable error does not need software intervention. >>>> * No need to go through error recovery process. >>>> */ >>>> - if (aer) >>>> + if (aer) { >>>> pci_write_config_dword(dev, aer + >>> PCI_ERR_COR_STATUS, >>>> info->status); >>>> + if (info->anfe_status) >>>> + pci_write_config_dword(dev, >>>> + aer + >>> PCI_ERR_UNCOR_STATUS, >>>> + info->anfe_status); >>>> + } >>> Why split the handling part and storing part into two patches? Why not >>> merge >>> this part of patch 1/3. >> This is based on Bjorn's suggestion at https://www.spinics.net/lists/linux- >pci/msg149012.html, >> clearing UNCOR_STATUS might be more important, deserve to raise out. > >I think Bjorn's suggestion is to divide it into two logical patches. >One for printing the error and another to clear the UNCOR_STATUS >properly. But currently you have split the UNCOR_STATUS status caching and >clearing process into two patches. IMO, your first patch can store ANFE >status and clear it. You can add print support in the second patch. OK, I'll merge patch1 with patch3 in next version. > > >Code wise it looks fine to me. You can add my Reviewed-by after fixing >the typos > >Reviewed-by: Kuppuswamy Sathyanarayanan ><sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx> Thanks zhenzhong > >> >> Thanks >> Zhenzhong > >-- >Sathyanarayanan Kuppuswamy >Linux Kernel Developer