Hello Francesco, On Wed, Nov 27, 2024 at 10:04:13PM +0200, Leon Romanovsky wrote: > On Wed, Nov 27, 2024 at 06:48:03PM +0100, Francesco Poli wrote: > > On Mon, 25 Nov 2024 21:38:37 +0200 Leon Romanovsky wrote: > > > > > On Mon, Nov 25, 2024 at 07:54:43PM +0100, Francesco Poli wrote: > > [...] > > > > I will try to continue to bisect by testing the resulting kernels on a > > > > compute node: there's no OpenSM there and it cannot run anyway, if > > > > there's another OpenSM on the same InfiniBand network. > > > > However, I can check whether those issm* symlinks are created in > > > > /sys/class/infiniband_mad/ > > > > I really hope that this is enough to pinpoint the first bad > > > > commit... > > > > > > Yes, these symlinks should be there. Your test scenario is correct one. > > > > OK, I have completed the bisect on a compute node without OpenSM, by > > looking at the issm* symlinks, as I said. > > > > See below. > > > > > > > > > > > > > Any better ideas? > > > > > > I think that commit: 2a5db20fa532 ("RDMA/mlx5: Add support to multi-plane device and port") > > > is the one which is causing to troubles, which leads me to suspect FW. > > [...] > > > > Thanks to your guess about the possibly troublesome commit, the bisect was completed in a few steps: > > > > $ git checkout 2a5db20fa532 > > $ make -j 12 my_defconfig bindeb-pkg > > > > [install this version on a compute node test image and reboot > > one compute node with that image: the InfiniBand network was > > working for that node, that's no surprise, since OpenSM was running > > on the head node, but no issm* symlink was created; please note > > that, surprisingly, the Ethernet network was not working, I mean > > that the Ethernet interfaces were not found by the kernel...] > > > > root@node # ls -altrF /sys/class/infiniband_mad/ > > total 0 > > drwxr-xr-x 60 root root 0 Nov 26 17:06 ../ > > lrwxrwxrwx 1 root root 0 Nov 26 17:06 umad0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/umad0/ > > -r--r--r-- 1 root root 4096 Nov 26 17:06 abi_version > > lrwxrwxrwx 1 root root 0 Nov 26 17:06 umad1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/umad1/ > > drwxr-xr-x 2 root root 0 Nov 26 17:08 ./ > > > > $ git bisect bad > > Bisecting: 0 revisions left to test after this (roughly 0 steps) > > [65528cfb21fdb68de8ae6dccae19af180d93e143] net/mlx5: mlx5_ifc update for multi-plane support > > $ make -j 12 my_defconfig bindeb-pkg > > > > [install this version on the compute node test image and reboot > > one compute node with that image: the InfiniBand network again > > working for that node, issm* symlinks were created; > > Ethernet network again not working for that node...] > > > > root@node # ls -altrF /sys/class/infiniband_mad/ > > total 0 > > drwxr-xr-x 60 root root 0 Nov 26 17:31 ../ > > lrwxrwxrwx 1 root root 0 Nov 26 17:31 umad0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/umad0/ > > -r--r--r-- 1 root root 4096 Nov 26 17:31 abi_version > > lrwxrwxrwx 1 root root 0 Nov 26 17:31 umad1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/umad1/ > > lrwxrwxrwx 1 root root 0 Nov 26 17:36 issm1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/issm1/ > > lrwxrwxrwx 1 root root 0 Nov 26 17:36 issm0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/issm0/ > > drwxr-xr-x 2 root root 0 Nov 26 17:36 ./ > > > > $ git bisect good > > 2a5db20fa532198639671713c6213f96ff285b85 is the first bad commit > > commit 2a5db20fa532198639671713c6213f96ff285b85 > > Author: Mark Zhang <markzhang@xxxxxxxxxx> > > Date: Sun Jun 16 19:08:35 2024 +0300 > > > > RDMA/mlx5: Add support to multi-plane device and port > > > > When multi-plane is supported, a logical port, which is aggregation of > > multiple physical plane ports, is exposed for data transmission. > > Compared with a normal mlx5 IB port, this logical port supports all > > functionalities except Subnet Management. > > > > Signed-off-by: Mark Zhang <markzhang@xxxxxxxxxx> > > Link: https://lore.kernel.org/r/7e37c06c9cb243be9ac79930cd17053903785b95.1718553901.git.leon@xxxxxxxxxx > > Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxx> > > > > drivers/infiniband/hw/mlx5/main.c | 60 +++++++++++++++++++++---- > > drivers/infiniband/hw/mlx5/mlx5_ib.h | 2 + > > drivers/net/ethernet/mellanox/mlx5/core/vport.c | 1 + > > include/linux/mlx5/driver.h | 1 + > > 4 files changed, 55 insertions(+), 9 deletions(-) > > > > > > In other words, bingo!, your guess looks correct, the first bad commit > > is the one you mentioned. > > > > > > Now, I will try to upgrade the firmware of the InfiniBand NICs, as you > > suggested, and check whether this solves the issue with the recent > > Linux kernel versions. > > > > Please confirm that the procedure to be followed is the one described in > > <https://docs.nvidia.com/networking/display/ubuntu2204/firmware+burning> > > Yes, it looks correct procedure. > If you didn't upgrade FW, this diff will achieve same result for you: > > diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c > index c2314797afc9..110ce177c305 100644 > --- a/drivers/infiniband/hw/mlx5/main.c > +++ b/drivers/infiniband/hw/mlx5/main.c > @@ -2846,7 +2846,7 @@ static int mlx5_ib_get_plane_num(struct mlx5_core_dev *mdev, u8 *num_plane) > if (err) > return err; > > - *num_plane = vport_ctx.num_plane; > + *num_plane = (vport_ctx.num_plane > 1) ? vport_ctx.num_plane : 0; > return 0; > } > > The culprit of your issue that in some FW versions, the vport_ctx.num_plane > was 1 and not 0 for devices which don't support that mode, while for the driver > everything that is not 0 means supported. I wonder if you could test a firmware upgrade or the above patch. Would be nice to know if there are still some things to do for us (= Debian kernel team) here. If everything is fine for you, I'd like to close this bug. Best regards Uwe
Attachment:
signature.asc
Description: PGP signature