On Mon, 25 Nov 2024 21:38:37 +0200 Leon Romanovsky wrote: > On Mon, Nov 25, 2024 at 07:54:43PM +0100, Francesco Poli wrote: [...] > > I will try to continue to bisect by testing the resulting kernels on a > > compute node: there's no OpenSM there and it cannot run anyway, if > > there's another OpenSM on the same InfiniBand network. > > However, I can check whether those issm* symlinks are created in > > /sys/class/infiniband_mad/ > > I really hope that this is enough to pinpoint the first bad > > commit... > > Yes, these symlinks should be there. Your test scenario is correct one. OK, I have completed the bisect on a compute node without OpenSM, by looking at the issm* symlinks, as I said. See below. > > > > > Any better ideas? > > I think that commit: 2a5db20fa532 ("RDMA/mlx5: Add support to multi-plane device and port") > is the one which is causing to troubles, which leads me to suspect FW. [...] Thanks to your guess about the possibly troublesome commit, the bisect was completed in a few steps: $ git checkout 2a5db20fa532 $ make -j 12 my_defconfig bindeb-pkg [install this version on a compute node test image and reboot one compute node with that image: the InfiniBand network was working for that node, that's no surprise, since OpenSM was running on the head node, but no issm* symlink was created; please note that, surprisingly, the Ethernet network was not working, I mean that the Ethernet interfaces were not found by the kernel...] root@node # ls -altrF /sys/class/infiniband_mad/ total 0 drwxr-xr-x 60 root root 0 Nov 26 17:06 ../ lrwxrwxrwx 1 root root 0 Nov 26 17:06 umad0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/umad0/ -r--r--r-- 1 root root 4096 Nov 26 17:06 abi_version lrwxrwxrwx 1 root root 0 Nov 26 17:06 umad1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/umad1/ drwxr-xr-x 2 root root 0 Nov 26 17:08 ./ $ git bisect bad Bisecting: 0 revisions left to test after this (roughly 0 steps) [65528cfb21fdb68de8ae6dccae19af180d93e143] net/mlx5: mlx5_ifc update for multi-plane support $ make -j 12 my_defconfig bindeb-pkg [install this version on the compute node test image and reboot one compute node with that image: the InfiniBand network again working for that node, issm* symlinks were created; Ethernet network again not working for that node...] root@node # ls -altrF /sys/class/infiniband_mad/ total 0 drwxr-xr-x 60 root root 0 Nov 26 17:31 ../ lrwxrwxrwx 1 root root 0 Nov 26 17:31 umad0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/umad0/ -r--r--r-- 1 root root 4096 Nov 26 17:31 abi_version lrwxrwxrwx 1 root root 0 Nov 26 17:31 umad1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/umad1/ lrwxrwxrwx 1 root root 0 Nov 26 17:36 issm1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/issm1/ lrwxrwxrwx 1 root root 0 Nov 26 17:36 issm0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/issm0/ drwxr-xr-x 2 root root 0 Nov 26 17:36 ./ $ git bisect good 2a5db20fa532198639671713c6213f96ff285b85 is the first bad commit commit 2a5db20fa532198639671713c6213f96ff285b85 Author: Mark Zhang <markzhang@xxxxxxxxxx> Date: Sun Jun 16 19:08:35 2024 +0300 RDMA/mlx5: Add support to multi-plane device and port When multi-plane is supported, a logical port, which is aggregation of multiple physical plane ports, is exposed for data transmission. Compared with a normal mlx5 IB port, this logical port supports all functionalities except Subnet Management. Signed-off-by: Mark Zhang <markzhang@xxxxxxxxxx> Link: https://lore.kernel.org/r/7e37c06c9cb243be9ac79930cd17053903785b95.1718553901.git.leon@xxxxxxxxxx Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxx> drivers/infiniband/hw/mlx5/main.c | 60 +++++++++++++++++++++---- drivers/infiniband/hw/mlx5/mlx5_ib.h | 2 + drivers/net/ethernet/mellanox/mlx5/core/vport.c | 1 + include/linux/mlx5/driver.h | 1 + 4 files changed, 55 insertions(+), 9 deletions(-) In other words, bingo!, your guess looks correct, the first bad commit is the one you mentioned. Now, I will try to upgrade the firmware of the InfiniBand NICs, as you suggested, and check whether this solves the issue with the recent Linux kernel versions. Please confirm that the procedure to be followed is the one described in <https://docs.nvidia.com/networking/display/ubuntu2204/firmware+burning> Thanks for your time and patience, and for all the help you are kindly providing! :-) -- http://www.inventati.org/frx/ There's not a second to spare! To the laboratory! ..................................................... Francesco Poli . GnuPG key fpr == CA01 1147 9CD2 EFDF FB82 3925 3E1C 27E1 1F69 BFFE
Attachment:
pgpkkZLVmNmBj.pgp
Description: PGP signature