Re: Missing infiniband network interfaces after update to 5.14/5.15

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Nov 14, 2021 at 8:05 AM Leon Romanovsky <leon@xxxxxxxxxx> wrote:
>
> On Fri, Nov 12, 2021 at 10:23:56AM -0400, Jason Gunthorpe wrote:
> > On Fri, Nov 12, 2021 at 09:23:04AM +0100, Jinpu Wang wrote:
> > > On Thu, Nov 11, 2021 at 12:29 PM Leon Romanovsky <leon@xxxxxxxxxx> wrote:
> > > >
> > > > On Thu, Nov 11, 2021 at 08:48:08AM +0100, Jinpu Wang wrote:
> > > > > Hi Jason, hi Leon,
> > > > >
> > > > > We are seeing exactly the same error reported here:
> > > > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094
> > > > >
> > > > > I suspect it's related to
> > > > > https://lore.kernel.org/all/cover.1623427137.git.leonro@xxxxxxxxxx/
> > > > >
> > > > > Do you have any idea, what goes wrong?
> > > >
> > > > I can't reproduce it with latest Fedora 34 RPM, which I downloaded from here
> > > > https://koji.fedoraproject.org/koji/buildinfo?buildID=1851842
> > > >
> > > > and also with kernel-5.14.7-200.fc34.x86_64 version mentioned in the bug
> > > > report.
> > > >
> > > > [leonro@c-235-8-1-005 ~]$ uname -a
> > > > Linux c-235-8-1-005 5.14.7-200.fc34.x86_64 #1 SMP Wed Sep 22 14:54:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> > > > [leonro@c-235-8-1-005 ~]$ rdma dev
> > > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953
> > > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63
> > > >
> > > > [leonro@c-235-8-1-005 ~]$ uname -a
> > > > Linux c-235-8-1-005 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
> > > > [leonro@c-235-8-1-005 ~]$ rdma dev
> > > > 0: ibp8s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7950 sys_image_guid 1c34:da03:0007:7953
> > > > 1: ibp9s0f0: node_type ca fw 2.42.5000 node_guid 1c34:da03:0007:7a60 sys_image_guid 1c34:da03:0007:7a63
> > > > [leonro@c-235-8-1-005 ~]$ lspci |grep nox
> > > > 08:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
> > > > 09:00.0 Network controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro]
> > > >
> > > > Thanks
> > > >
> > > Hi,
> > >
> > > I tried different host with CX-3/CX-5, they all work fine. and I can
> > > only reproduce on hosts with a bit old HCA:
> > > 03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe
> > > 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
> > >
> > > The bug report link
> > > https://bugzilla.redhat.com/show_bug.cgi?id=2014094, mentioned HCA
> > > ConnectX too.
> > >
> > > 01:00.0 InfiniBand [0c06]: Mellanox Technologies MT25408A0-FCC-GI
> > > ConnectX, Dual Port 20Gb/s InfiniBand / 10GigE Adapter IC with PCIe
> > > 2.0 x8 5.0GT/s In... (rev b0)
> > > with the instrument, I only narrow it down to
> > > 1438                 port = setup_port(coredev, port_num, &attr);
> > > 1439                 if (IS_ERR(port)) {
> > > 1440                         ret = PTR_ERR(port);
> > > 1441                         pr_info("setup ports failed %d\n", ret);
> > > 1442                         goto err_put;
> > > 1443                 }
> >
> > Keep going with the tracing, there are lots of allocations in there.
> >
> > > My guess is the ConnectX HCA may be missing some features, which leads
> > > to ENOMEM, I will continue the instrument if no other hint.
> >
> > Since there is no memory allocation failure splat I'm guessing some
> > memory allocation hit an overflow and silently failed - ie mlx4 is
> > possibily setting some value to something bogus
>
> Yes, look for the values returned from FW.
Hi Leon, hi Jason

I've found the problem, the device doesn't support per port diag
counters, and the driver then fails the register which is
too harsh.

I'm not sure how to fix it properly, your thought?

Thanks


[ 3426.452062] <mlx4_ib> mlx4_ib_add: counter index 1 for port 2 allocated 0
[ 3426.452067] <mlx4_ib> mlx4_ib_alloc_diag_counters: #### i =1,
per_port 0  // device MLX4_DEV_CAP_FLAG2_DIAG_PER_PORT not set. which
lead to the allocation failure.
[ 3426.494000] <mlx4_ib> mlx4_ib_alloc_hw_port_stats:
mlx4_ib_alloc_hw_port_stats name null
[ 3426.494170] <mlx4_ib> mlx4_ib_alloc_hw_port_stats:
mlx4_ib_alloc_hw_port_stats name null
[ 3426.494174] ibdev ops alloc_hw_stats_port failed
[ 3426.494175] alloc_hw_stats_port failed
[ 3426.494177] setup_hw_port_stats failed, -12
[ 3426.494181] setup ports failed -12
[ 3426.494190] infiniband mlx4_0: Couldn't register device with driver model


>
> >
> > Jason



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux