On Fri, Jan 12, 2024 at 10:41:07AM -0600, Bjorn Helgaas wrote: > On Thu, Jan 11, 2024 at 03:32:15PM +0800, Wang, Qingshun wrote: > > According to PCIe specification 4.0 sections 6.2.3.2.4 and 6.2.4.3, > > certain uncorrectable errors will signal ERR_COR instead of > > ERR_NONFATAL, logged as Advisory Non-Fatal Error, and set bits in > > both Correctable Error Status register and Uncorrectable Error Status > > register. Currently, when handling AER event the kernel will only look > > at CE status or UE status, but never both. In the Advisory > > Non-Fatal Error case, bits set in UE status register will not be > > reported and cleared until the next Fatal/Non-Fatal error arrives. > > > > For instance, before this patch series, once kernel receives an ANFE > > with Poisoned TLP in OS native AER mode, only status of CE will be > > reported and cleared: > > > > [ 148.459816] pcieport 0000:b7:02.0: AER: Corrected error received: 0000:b7:02.0 > > [ 148.459858] pcieport 0000:b7:02.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID) > > [ 148.459863] pcieport 0000:b7:02.0: device [8086:0db0] error status/mask=00002000/00000000 > > [ 148.459868] pcieport 0000:b7:02.0: [13] NonFatalErr > > > > If the kernel receives a Malformed TLP after that, two UE will be > > reported, which is unexpected. Malformed TLP Header was lost since > > the previous ANF gated the TLP header logs: > > > > [ 170.540192] pcieport 0000:b7:02.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID) > > [ 170.552420] pcieport 0000:b7:02.0: device [8086:0db0] error status/mask=00041000/00180020 > > [ 170.561904] pcieport 0000:b7:02.0: [12] TLP (First) > > [ 170.569656] pcieport 0000:b7:02.0: [18] MalfTLP > > > > To handle this case properly, this patch set adds additional fields > > in aer_err_info to track both CE and UE status/mask and UE severity. > > This information will later be used to determine the register bits > > that need to be cleared. Additionally, adds more data to aer_event > > tracepoint, which would help to better understand ANFE and other errors > > for external observation. > > > > In the previous scenario, after this patch series, both CE status and > > related UE status will be reported and cleared after ANFE: > > > > [ 148.459816] pcieport 0000:b7:02.0: AER: Corrected error received: 0000:b7:02.0 > > [ 148.459858] pcieport 0000:b7:02.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID) > > [ 148.459863] pcieport 0000:b7:02.0: device [8086:0db0] error status/mask=00002000/00000000 > > [ 148.459868] pcieport 0000:b7:02.0: [13] NonFatalErr > > [ 148.459868] pcieport 0000:b7:02.0: Uncorrectable errors that may cause Advisory Non-Fatal: > > [ 148.459868] pcieport 0000:b7:02.0: [18] TLP > > Thanks for the overview here. It would be good to put some of these > details in the commit logs of the patches that implement this, because > this cover letter is not preserved when the series is merged. Thanks for your advice, will put some of these details in commit logs, mainly in PATCH 2. > > If/when you do, remove the timestamps because they're not relevant and > are merely distracting. Indent quoted material a couple spaces. Agreed. Thanks. > > Update the citations to a current spec revision (PCIe r6.0, or maybe > PCIe r6.1). The section numbers are probably the same, but there's no > point in citing a revision that's 6.5 years old when newer ones are > available. Makes sense, thanks! > > Bjorn