Hi Pali, Can you please indicate what is the HW source (e.g. which register in the PCIe controller) bits is translated to the fatal abort? Regarding the Erratum, basically to the best of my understanding, if the End Point is posting a PCIe write to the host, and the host is trying to read the End Point registers via PCIe, Completion data is generated . With the strong ordered mode, one would expect that first the post write will finish, and only then the completion data will be processed. The Erratum means that this mode is not supported. The DIS_ORD_CHK must be set to disable this feature, which is not supported by HW. Regarding Bjorn comment, not enabling this bit will not help as the strong-order feature is not implemented in HW. Leaving this bit disabled will not make the HW enforce strong-order. There is no detailed description of the HW behavior when the bit is disabled per the default, but as is clearly evident from the Erratum and from your own experience, leaving this bit disabled would not create the correct, expected behavior from the HW, which is why it must be enabled for correct functionality of the system (both hardware and software). Regarding the patch - I would also add a full memory barrier (if you use interrupts on the host to handle the write completion - then in the PCIe driver interrupt handler, otherwise this will require modifying the specific WIFI driver) in order to minimize the risk for the race condition documented in the Erratum between the DMA done status reading and the completion of writing to the host memory. This of course does not guarantee order, but it is better than leaving it the way it is. Hopefully this helps, Elad. -----Original Message----- From: Pali Rohár <pali@xxxxxxxxxx> Sent: Sunday, July 10, 2022 2:21 PM To: Elad Nachman <enachman@xxxxxxxxxxx>; Ratheesh Kannoth <rkannoth@xxxxxxxxxxx>; Tanmay Jagdale <tanmay@xxxxxxxxxxx>; Shijith Thotton <sthotton@xxxxxxxxxxx>; Arun Easi <aeasi@xxxxxxxxxxx> Cc: Krzysztof Wilczyński <kw@xxxxxxxxx>; Lorenzo Pieralisi <lorenzo.pieralisi@xxxxxxx>; Thomas Petazzoni <thomas.petazzoni@xxxxxxxxxxx>; Bjorn Helgaas <bhelgaas@xxxxxxxxxx>; Marek Behún <kabel@xxxxxxxxxx>; Remi Pommarel <repk@xxxxxxxxxxxx>; Xogium <contact@xxxxxxxxx>; linux-pci@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; Bharat Bhushan <bbhushan2@xxxxxxxxxxx>; Veerasenareddy Burru <vburru@xxxxxxxxxxx>; Wojciech Bartczak <wbartczak@xxxxxxxxxxx> Subject: [EXT] Re: Issues with A3720 PCIe controller driver pci-aardvark.c External Email ---------------------------------------------------------------------- + Other people from Marvell active on LKML. Could you please look at this issue and give us some comment? It is really critical issue which needs to be solved. On Wednesday 16 February 2022 21:09:40 Pali Rohár wrote: > + Bharat, Veerasenareddy and Wojciech from Marvell > > Hello! Could you please look at this email and help us with this Marvell HW issue? > > On Saturday 24 July 2021 00:17:10 Pali Rohár wrote: > > Hello Konstantin! > > > > There are issues with Marvell Armada 3720 PCIe controller when high > > performance PCIe card (e.g. WiFi AX) is connected to this SOC. Under > > heavy load PCIe controller sends fatal abort to CPU and kernel crash. > > > > In Marvell Armada 3700 Functional Errata, Guidelines, and > > Restrictions document is described erratum 3.12 PCIe Completion > > Timeout (Ref #: 251) which may be relevant. But neither Bjorn, > > Thomas nor me were able to understood text of this erratum. And we > > have already spent lot of time on this erratum. My guess that is > > that in erratum itself are mistakes and there are missing some other important details. > > > > Konstantin, are you able to understand this erratum? Or do you know > > somebody in Marvell who understand this erratum and can explain > > details to us? Or do you know some more details about this erratum? > > > > Also it would be useful if you / Marvell could share text of this > > erratum with linux-pci people as currently it is available only on > > Marvell Customer Portal which requires registration with signed NDA. > > > > In past Thomas wrote patch "according to this erratum" and I have > > rebased, rewritten and resent it to linux-pci mailing list for review: > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org > > _linux-2Dpci_20210624222621.4776-2D6-2Dpali-40kernel.org_&d=DwIDaQ&c > > =nKjWec2b6R0mOyPaz7xtfQ&r=eTeNTLEK5-TxXczjOcKPhANIFtlB9pP4lq9qhdlFrw > > Q&m=opYsQsv_sfSvTtA5oJwc1paZrPAWMHVhTx_9J1VWBVDksBETCXVsC3rRDb5ejgg- > > &s=AKbEBWOIxa4A0QSFXiq6HhKpByn0hPJuZvbxsu3m8oo&e= > > > > Similar patch is available also in kernel which is part of Marvell SDK. > > > > Bjorn has objections for this patch as he thinks that bit > > DIS_ORD_CHK in that patch should be disabled. Seems that enabling > > this bit effectively disables PCIe strong ordering model. PCIe > > kernel drivers rely on PCIe strong ordering, so it would implicate > > that that bit should not be enabled. Which is opposite of what is mentioned patch doing. > > > > Konstantin, could you help us with this problem?