Hi Dennis, On Wed, Apr 10, 2019 at 05:35:01AM -0700, Dennis Dalessandro wrote: > From: Kamenee Arumugam <kamenee.arumugam@xxxxxxxxx> > > For hfi1, the unsupported request error is not considered a fatal > error. When the PCIe advanced error reporting capability (AER) is > configured to report unsupported requests as fatal, the system will > hang on this error. I know there are a few drivers that fiddle with AER bits, but that makes me a little bit nervous because error handling is more than just a driver issue. It involves the PCI core and the platform firmware as well. Anyway, let's figure out more about this particular case. Unsupported Request is a PCIe protocol-level issue. You're masking it in the HFI adapter, which I guess means you want to prevent it from reporting UR. So the HFI is receiving a TLP that it doesn't support? What exactly is causing the UR? Is it something the driver could potentially avoid, e.g., an AtomicOp that HFI doesn't support? I have a vague notion that InfiniBand allows some sort of direct user-space access to hardware; is there something there that can cause a UR? The system hang sounds like a separate problem that should also be fixed. Even if HFI signals a UR error, I would not expect a system hang. Bjorn > Set Unsupported Request Error bit in Uncorrectable Error Mask > register to disable error reporting to the PCIe root complex. > > Reviewed-by: Michael J. Ruhl <michael.j.ruhl@xxxxxxxxx> > Signed-off-by: Kamenee Arumugam <kamenee.arumugam@xxxxxxxxx> > Signed-off-by: Dennis Dalessandro <dennis.dalessandro@xxxxxxxxx> > --- > drivers/infiniband/hw/hfi1/pcie.c | 1 + > 1 files changed, 1 insertions(+), 0 deletions(-) > > diff --git a/drivers/infiniband/hw/hfi1/pcie.c b/drivers/infiniband/hw/hfi1/pcie.c > index c96d193..a033e28 100644 > --- a/drivers/infiniband/hw/hfi1/pcie.c > +++ b/drivers/infiniband/hw/hfi1/pcie.c > @@ -114,6 +114,7 @@ int hfi1_pcie_init(struct hfi1_devdata *dd) > } > > pci_set_master(pdev); > + pcie_aer_set_dword(pdev, PCI_ERR_UNCOR_MASK, PCI_ERR_UNC_UNSUP); > (void)pci_enable_pcie_error_reporting(pdev); > return 0; > >