On Fri, Nov 12, 2021 at 10:23:56AM -0400, Jason Gunthorpe wrote: > On Fri, Nov 12, 2021 at 09:23:04AM +0100, Jinpu Wang wrote: > > On Thu, Nov 11, 2021 at 12:29 PM Leon Romanovsky <leon@xxxxxxxxxx> wrote: > > > > > > On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote: > > > > Hi Jason, hi Leon, > > > > > > > > We are seeing exactly the same error reported here: > > > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094 > > > > > > > > I suspect it's related to > > > > https://lore.kernel.org/all/cover.1623427137.git.leonro@xxxxxxxxxx/ > > > > > > > > Do you have any idea, what goes wrong? > > > > > > I can't reproduce it with latest Fedora 34 RPM, which I downloaded from here > > > https://koji.fedoraproject.org/koji/buildinfo?buildID=1851842 > > > > > > and also with kernel-5.14.7-200.fc34.x86_64 version mentioned in the bug > > > report. > > > > > > [leonro@c-235-8-1-005 ~]$ uname -a > > > Linux c-235-8-1-005 5.14.7-200.fc34.x86_64 #1 SMP Wed Sep 22 14:54:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > > > [leonro@c-235-8-1-005 ~]$ rdma dev > > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953 > > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63 > > > > > > [leonro@c-235-8-1-005 ~]$ uname -a > > > Linux c-235-8-1-005 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > > > [leonro@c-235-8-1-005 ~]$ rdma dev > > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953 > > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63 > > > [leonro@c-235-8-1-005 ~]$ lspci |grep nox > > > 08:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] > > > 09:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] > > > > > > Thanks > > > > > Hi, > > > > I tried different host with CX-3/CX-5, they all work fine. and I can > > only reproduce on hosts with a bit old HCA: > > 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe > > 2.0 5GT/s - IB QDR / 10GigE] (rev b0) > > > > The bug report link > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094, mentioned HCA > > ConnectX too. > > > > 01:00.0 InfiniBand [0c06]: Mellanox Technologies MT25408A0-FCC-GI > > ConnectX, Dual Port 20Gb/s InfiniBand / 10GigE Adapter IC with PCIe > > 2.0 x8 5.0GT/s In... (rev b0) > > with the instrument, I only narrow it down to > > 1438 port = setup_port(coredev, port_num, &attr); > > 1439 if (IS_ERR(port)) { > > 1440 ret = PTR_ERR(port); > > 1441 pr_info("setup ports failed %d\n", ret); > > 1442 goto err_put; > > 1443 } > > Keep going with the tracing, there are lots of allocations in there. > > > My guess is the ConnectX HCA may be missing some features, which leads > > to ENOMEM, I will continue the instrument if no other hint. > > Since there is no memory allocation failure splat I'm guessing some > memory allocation hit an overflow and silently failed - ie mlx4 is > possibily setting some value to something bogus Yes, look for the values returned from FW. > > Jason