Re: Bug#1086520: linux-image-6.11.2-amd64: makes opensm fail to start

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 25 Nov 2024 21:38:37 +0200 Leon Romanovsky wrote:

> On Mon, Nov 25, 2024 at 07:54:43PM +0100, Francesco Poli wrote:
[...]
> > I will try to continue to bisect by testing the resulting kernels on a
> > compute node: there's no OpenSM there and it cannot run anyway, if
> > there's another OpenSM on the same InfiniBand network.
> > However, I can check whether those issm* symlinks are created in
> > /sys/class/infiniband_mad/ 
> > I really hope that this is enough to pinpoint the first bad
> > commit...
> 
> Yes, these symlinks should be there. Your test scenario is correct one.

OK, I have completed the bisect on a compute node without OpenSM, by
looking at the issm* symlinks, as I said.

See below.

> 
> > 
> > Any better ideas?
> 
> I think that commit: 2a5db20fa532 ("RDMA/mlx5: Add support to multi-plane device and port")
> is the one which is causing to troubles, which leads me to suspect FW.
[...]

Thanks to your guess about the possibly troublesome commit, the bisect was completed in a few steps:

  $ git checkout 2a5db20fa532
  $ make -j 12 my_defconfig bindeb-pkg
  
  [install this version on a compute node test image and reboot
  one compute node with that image: the InfiniBand network was
  working for that node, that's no surprise, since OpenSM was running
  on the head node, but no issm* symlink was created; please note
  that, surprisingly, the Ethernet network was not working, I mean
  that the Ethernet interfaces were not found by the kernel...]
  
  root@node # ls -altrF /sys/class/infiniband_mad/
  total 0
  drwxr-xr-x 60 root root    0 Nov 26 17:06 ../
  lrwxrwxrwx  1 root root    0 Nov 26 17:06 umad0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/umad0/
  -r--r--r--  1 root root 4096 Nov 26 17:06 abi_version
  lrwxrwxrwx  1 root root    0 Nov 26 17:06 umad1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/umad1/
  drwxr-xr-x  2 root root    0 Nov 26 17:08 ./
  
  $ git bisect bad
  Bisecting: 0 revisions left to test after this (roughly 0 steps)
  [65528cfb21fdb68de8ae6dccae19af180d93e143] net/mlx5: mlx5_ifc update for multi-plane support
  $ make -j 12 my_defconfig bindeb-pkg
  
  [install this version on the compute node test image and reboot
  one compute node with that image: the InfiniBand network again
  working for that node, issm* symlinks were created;
  Ethernet network again not working for that node...]
  
  root@node # ls -altrF /sys/class/infiniband_mad/
  total 0
  drwxr-xr-x 60 root root    0 Nov 26 17:31 ../
  lrwxrwxrwx  1 root root    0 Nov 26 17:31 umad0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/umad0/
  -r--r--r--  1 root root 4096 Nov 26 17:31 abi_version
  lrwxrwxrwx  1 root root    0 Nov 26 17:31 umad1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/umad1/
  lrwxrwxrwx  1 root root    0 Nov 26 17:36 issm1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/issm1/
  lrwxrwxrwx  1 root root    0 Nov 26 17:36 issm0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/issm0/
  drwxr-xr-x  2 root root    0 Nov 26 17:36 ./
  
  $ git bisect good
  2a5db20fa532198639671713c6213f96ff285b85 is the first bad commit
  commit 2a5db20fa532198639671713c6213f96ff285b85
  Author: Mark Zhang <markzhang@xxxxxxxxxx>
  Date:   Sun Jun 16 19:08:35 2024 +0300
  
      RDMA/mlx5: Add support to multi-plane device and port
  
      When multi-plane is supported, a logical port, which is aggregation of
      multiple physical plane ports, is exposed for data transmission.
      Compared with a normal mlx5 IB port, this logical port supports all
      functionalities except Subnet Management.
  
      Signed-off-by: Mark Zhang <markzhang@xxxxxxxxxx>
      Link: https://lore.kernel.org/r/7e37c06c9cb243be9ac79930cd17053903785b95.1718553901.git.leon@xxxxxxxxxx
      Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxx>
  
   drivers/infiniband/hw/mlx5/main.c               | 60 +++++++++++++++++++++----
   drivers/infiniband/hw/mlx5/mlx5_ib.h            |  2 +
   drivers/net/ethernet/mellanox/mlx5/core/vport.c |  1 +
   include/linux/mlx5/driver.h                     |  1 +
   4 files changed, 55 insertions(+), 9 deletions(-)


In other words, bingo!, your guess looks correct, the first bad commit
is the one you mentioned.


Now, I will try to upgrade the firmware of the InfiniBand NICs, as you
suggested, and check whether this solves the issue with the recent
Linux kernel versions.

Please confirm that the procedure to be followed is the one described in
<https://docs.nvidia.com/networking/display/ubuntu2204/firmware+burning>

Thanks for your time and patience, and for all the help you are kindly
providing!   :-)


-- 
 http://www.inventati.org/frx/
 There's not a second to spare! To the laboratory!
..................................................... Francesco Poli .
 GnuPG key fpr == CA01 1147 9CD2 EFDF FB82  3925 3E1C 27E1 1F69 BFFE

Attachment: pgpkkZLVmNmBj.pgp
Description: PGP signature


[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux