Re: [PATCH 0/4] pci/aer: Handle Advisory Non-Fatal properly

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jan 12, 2024 at 10:41:07AM -0600, Bjorn Helgaas wrote:
> On Thu, Jan 11, 2024 at 03:32:15PM +0800, Wang, Qingshun wrote:
> > According to PCIe specification 4.0 sections 6.2.3.2.4 and 6.2.4.3,
> > certain uncorrectable errors will signal ERR_COR instead of
> > ERR_NONFATAL, logged as Advisory Non-Fatal Error, and set bits in
> > both Correctable Error Status register and Uncorrectable Error Status
> > register. Currently, when handling AER event the kernel will only look
> > at CE status or UE status, but never both. In the Advisory
> > Non-Fatal Error case, bits set in UE status register will not be
> > reported and cleared until the next Fatal/Non-Fatal error arrives.
> > 
> > For instance, before this patch series, once kernel receives an ANFE
> > with Poisoned TLP in OS native AER mode, only status of CE will be
> > reported and cleared:
> > 
> > [  148.459816] pcieport 0000:b7:02.0: AER: Corrected error received: 0000:b7:02.0
> > [  148.459858] pcieport 0000:b7:02.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
> > [  148.459863] pcieport 0000:b7:02.0:   device [8086:0db0] error status/mask=00002000/00000000
> > [  148.459868] pcieport 0000:b7:02.0:    [13] NonFatalErr           
> > 
> > If the kernel receives a Malformed TLP after that, two UE will be
> > reported, which is unexpected. Malformed TLP Header was lost since
> > the previous ANF gated the TLP header logs:
> > 
> > [  170.540192] pcieport 0000:b7:02.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
> > [  170.552420] pcieport 0000:b7:02.0:   device [8086:0db0] error status/mask=00041000/00180020
> > [  170.561904] pcieport 0000:b7:02.0:    [12] TLP                    (First)
> > [  170.569656] pcieport 0000:b7:02.0:    [18] MalfTLP       
> > 
> > To handle this case properly, this patch set adds additional fields
> > in aer_err_info to track both CE and UE status/mask and UE severity.
> > This information will later be used to determine the register bits
> > that need to be cleared. Additionally, adds more data to aer_event
> > tracepoint, which would help to better understand ANFE and other errors
> > for external observation.
> > 
> > In the previous scenario, after this patch series, both CE status and
> > related UE status will be reported and cleared after ANFE:
> > 
> > [  148.459816] pcieport 0000:b7:02.0: AER: Corrected error received: 0000:b7:02.0
> > [  148.459858] pcieport 0000:b7:02.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
> > [  148.459863] pcieport 0000:b7:02.0:   device [8086:0db0] error status/mask=00002000/00000000
> > [  148.459868] pcieport 0000:b7:02.0:    [13] NonFatalErr           
> > [  148.459868] pcieport 0000:b7:02.0:   Uncorrectable errors that may cause Advisory Non-Fatal:
> > [  148.459868] pcieport 0000:b7:02.0:    [18] TLP
> 
> Thanks for the overview here.  It would be good to put some of these
> details in the commit logs of the patches that implement this, because
> this cover letter is not preserved when the series is merged.
Thanks for your advice, will put some of these details in commit logs, 
mainly in PATCH 2. 
> 
> If/when you do, remove the timestamps because they're not relevant and
> are merely distracting.  Indent quoted material a couple spaces.
Agreed. Thanks.
> 
> Update the citations to a current spec revision (PCIe r6.0, or maybe
> PCIe r6.1).  The section numbers are probably the same, but there's no
> point in citing a revision that's 6.5 years old when newer ones are
> available.
Makes sense, thanks!
> 
> Bjorn




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux