On Sun, Nov 14, 2021 at 8:05 AM Leon Romanovsky <leon@xxxxxxxxxx> wrote: > > On Fri, Nov 12, 2021 at 10:23:56AM -0400, Jason Gunthorpe wrote: > > On Fri, Nov 12, 2021 at 09:23:04AM +0100, Jinpu Wang wrote: > > > On Thu, Nov 11, 2021 at 12:29 PM Leon Romanovsky <leon@xxxxxxxxxx> wrote: > > > > > > > > On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote: > > > > > Hi Jason, hi Leon, > > > > > > > > > > We are seeing exactly the same error reported here: > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094 > > > > > > > > > > I suspect it's related to > > > > > https://lore.kernel.org/all/cover.1623427137.git.leonro@xxxxxxxxxx/ > > > > > > > > > > Do you have any idea, what goes wrong? > > > > > > > > I can't reproduce it with latest Fedora 34 RPM, which I downloaded from here > > > > https://koji.fedoraproject.org/koji/buildinfo?buildID=1851842 > > > > > > > > and also with kernel-5.14.7-200.fc34.x86_64 version mentioned in the bug > > > > report. > > > > > > > > [leonro@c-235-8-1-005 ~]$ uname -a > > > > Linux c-235-8-1-005 5.14.7-200.fc34.x86_64 #1 SMP Wed Sep 22 14:54:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > > > > [leonro@c-235-8-1-005 ~]$ rdma dev > > > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953 > > > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63 > > > > > > > > [leonro@c-235-8-1-005 ~]$ uname -a > > > > Linux c-235-8-1-005 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux > > > > [leonro@c-235-8-1-005 ~]$ rdma dev > > > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953 > > > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63 > > > > [leonro@c-235-8-1-005 ~]$ lspci |grep nox > > > > 08:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] > > > > 09:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] > > > > > > > > Thanks > > > > > > > Hi, > > > > > > I tried different host with CX-3/CX-5, they all work fine. and I can > > > only reproduce on hosts with a bit old HCA: > > > 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe > > > 2.0 5GT/s - IB QDR / 10GigE] (rev b0) > > > > > > The bug report link > > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094, mentioned HCA > > > ConnectX too. > > > > > > 01:00.0 InfiniBand [0c06]: Mellanox Technologies MT25408A0-FCC-GI > > > ConnectX, Dual Port 20Gb/s InfiniBand / 10GigE Adapter IC with PCIe > > > 2.0 x8 5.0GT/s In... (rev b0) > > > with the instrument, I only narrow it down to > > > 1438 port = setup_port(coredev, port_num, &attr); > > > 1439 if (IS_ERR(port)) { > > > 1440 ret = PTR_ERR(port); > > > 1441 pr_info("setup ports failed %d\n", ret); > > > 1442 goto err_put; > > > 1443 } > > > > Keep going with the tracing, there are lots of allocations in there. > > > > > My guess is the ConnectX HCA may be missing some features, which leads > > > to ENOMEM, I will continue the instrument if no other hint. > > > > Since there is no memory allocation failure splat I'm guessing some > > memory allocation hit an overflow and silently failed - ie mlx4 is > > possibily setting some value to something bogus > > Yes, look for the values returned from FW. Hi Leon, hi Jason I've found the problem, the device doesn't support per port diag counters, and the driver then fails the register which is too harsh. I'm not sure how to fix it properly, your thought? Thanks [ 3426.452062] <mlx4_ib> mlx4_ib_add: counter index 1 for port 2 allocated 0 [ 3426.452067] <mlx4_ib> mlx4_ib_alloc_diag_counters: #### i =1, per_port 0 // device MLX4_DEV_CAP_FLAG2_DIAG_PER_PORT not set. which lead to the allocation failure. [ 3426.494000] <mlx4_ib> mlx4_ib_alloc_hw_port_stats: mlx4_ib_alloc_hw_port_stats name null [ 3426.494170] <mlx4_ib> mlx4_ib_alloc_hw_port_stats: mlx4_ib_alloc_hw_port_stats name null [ 3426.494174] ibdev ops alloc_hw_stats_port failed [ 3426.494175] alloc_hw_stats_port failed [ 3426.494177] setup_hw_port_stats failed, -12 [ 3426.494181] setup ports failed -12 [ 3426.494190] infiniband mlx4_0: Couldn't register device with driver model > > > > > Jason