----- Original Message ----- > From: "Thomas Bogendoerfer" <tbogendoerfer@xxxxxxx> > To: "Timothy Pearson" <tpearson@xxxxxxxxxxxxxxxxxxxxx> > Cc: "Rudolf Gabler" <rug@xxxxxxxxxx>, "linux-rdma" <linux-rdma@xxxxxxxxxxxxxxx> > Sent: Tuesday, December 17, 2024 2:10:42 AM > Subject: Re: Infiniband crash > On Mon, 16 Dec 2024 12:05:39 -0600 > tpearson@xxxxxxxxxxxxxxxxxxxxx wrote: > >> Did you ever find a solution for this? We're running into the same problem on a >> highly customized aarch64 system (NXP QorIQ platform), same Infinband adapter >> and very similar crash: >> >> [ 4.544159] OF: /soc/pcie@3600000: no iommu-map translation for id 0x100 on >> (null) >> [ 4.551873] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) >> [ 4.558690] ib_mthca: Initializing 0000:01:00.0 >> [ 6.258309] ib_mthca 0000:01:00.0: HCA FW version 5.1.000 is old (5.3.000 is >> current). >> [ 6.266272] ib_mthca 0000:01:00.0: If you have problems, try updating your >> HCA FW. >> [ 6.393143] ib_mthca 0000:01:00.0 ibp1s0: renamed from ib0 >> [ 6.399038] Unable to handle kernel NULL pointer dereference at virtual >> address 0000000000000010 >> [ 6.407865] Mem abort info: >> [ 6.410662] ESR = 0x0000000096000004 >> [ 6.414419] EC = 0x25: DABT (current EL), IL = 32 bits >> [ 6.419748] SET = 0, FnV = 0 >> [ 6.422806] EA = 0, S1PTW = 0 >> [ 6.425952] FSC = 0x04: level 0 translation fault >> [ 6.430842] Data abort info: >> [ 6.433725] ISV = 0, ISS = 0x00000004 >> [ 6.437569] CM = 0, WnR = 0 >> [ 6.440540] user pgtable: 4k pages, 48-bit VAs, pgdp=0000008086f60000 >> [ 6.447003] [0000000000000010] pgd=0000000000000000, p4d=0000000000000000 >> [ 6.453819] Internal error: Oops: 0000000096000004 [#1] SMP >> [ 6.459412] Modules linked in: ib_ipoib(E) ib_umad(E) rdma_ucm(E) rdma_cm(E) >> iw_cm(E) ib_cm(E) configfs(E) ib_mthca(E) ib_uverbs(E) ib_core(E) >> [ 6.472263] CPU: 0 PID: 100 Comm: kworker/u17:0 Tainted: G E >> 6.1.0+ #55 >> [ 6.480297] Hardware name: Freescale Layerscape 2080a RDB Board (DT) >> [ 6.486670] Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core] >> [ 6.492636] pstate: 800000c5 (Nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) >> [ 6.499624] pc : mthca_poll_cq+0x4f0/0x9a0 [ib_mthca] >> [ 6.504703] lr : mthca_poll_cq+0x1e8/0x9a0 [ib_mthca] >> >> Since this is apparently hitting two different architectures, I suspect the >> problem is in the driver, not the arch-specific code. I may recommend we >> upgrade the card to work around this, but given the rarity of the hardware it's >> not something I want to recommend tinkering with and it may or may not even >> accept the new card in the first place. > > which kernel version is this ? It looks like the bug fixed with > > dc52aadbc184 RDMA/mthca: Fix crash when polling CQ for shared QPs > > Thomas. Kernel 6.1 -- this is a custom build for the rather odd aarch64 platform in use, and v6.1 was selected due to the use of Debian Bookworm. I can confirm applying the patch referenced above resolves the crash. Thanks for the pointer!