-----Original Message----- From: linux-rdma-owner@xxxxxxxxxxxxxxx [mailto:linux-rdma-owner@xxxxxxxxxxxxxxx] On Behalf Of Bjorn Helgaas Sent: Wednesday, April 10, 2019 3:30 PM To: Dalessandro, Dennis <dennis.dalessandro@xxxxxxxxx> Cc: jgg@xxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx; Ruhl, Michael J <michael.j.ruhl@xxxxxxxxx>; dledford@xxxxxxxxxx; Arumugam, Kamenee <kamenee.arumugam@xxxxxxxxx> Subject: Re: [PATCH for-next 2/2] IB/hfi1: Make Unsupported Request error non-fatal Hi Bjorn, > I know there are a few drivers that fiddle with AER bits, but that makes me a little bit nervous because error handling is more than just a > driver issue. It involves the PCI core and the platform firmware as well. > Anyway, let's figure out more about this particular case. Unsupported > Request is a PCIe protocol-level issue. You're masking it in the HFI adapter, which I guess means you want to prevent it from reporting UR. > So the HFI is receiving a TLP that it doesn't support? Yes, HFI is receiving a TLP with unsupported request error. > What exactly is causing the UR? Is it something the driver could potentially avoid, e.g., an AtomicOp that HFI doesn't support? I have a > vague notion that InfiniBand allows some sort of direct user-space access to hardware; is there something there that can cause a UR? HFI PCIe BAR are mapped to user space to implement kernel bypass for MPI/PSM jobs. In this case, user-level application is making spurious read accesses (invalid width access) to this memory mapping causing the device to report an unsupported request error through AER. The spurious read accesses may be due to errant application behavior (e.g. reading beyond the end of an array). > The system hang sounds like a separate problem that should also be fixed. Even if HFI signals a UR error, I would not expect a system > > hang. We haven't root cause the system hang but it doesn't appear to be related to our driver. >> Set Unsupported Request Error bit in Uncorrectable Error Mask register >> to disable error reporting to the PCIe root complex. >> >> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@xxxxxxxxx> >> Signed-off-by: Kamenee Arumugam <kamenee.arumugam@xxxxxxxxx> >> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@xxxxxxxxx> >> --- >> drivers/infiniband/hw/hfi1/pcie.c | 1 + >> 1 files changed, 1 insertions(+), 0 deletions(-) >> >> diff --git a/drivers/infiniband/hw/hfi1/pcie.c >> b/drivers/infiniband/hw/hfi1/pcie.c >> index c96d193..a033e28 100644 >> --- a/drivers/infiniband/hw/hfi1/pcie.c >> +++ b/drivers/infiniband/hw/hfi1/pcie.c >> @@ -114,6 +114,7 @@ int hfi1_pcie_init(struct hfi1_devdata *dd) >> } >> >> pci_set_master(pdev); >> + pcie_aer_set_dword(pdev, PCI_ERR_UNCOR_MASK, PCI_ERR_UNC_UNSUP); >> (void)pci_enable_pcie_error_reporting(pdev); >> return 0; >> >>