Hello Francesco, [for the new-comers: This is about a regression in 6.11. Details available at https://bugs.debian.org/1086520. The TL;DR; is that on 6.10.11 opensm works as expected, while it fails to start on 6.11.7.] On Mon, Nov 18, 2024 at 08:06:16PM +0100, Francesco Poli wrote: > On Mon, 18 Nov 2024 09:58:03 +0100 Uwe Kleine-König wrote: > > [...] > > On Wed, Nov 13, 2024 at 11:15:03PM +0100, Francesco Poli wrote: > > > On Mon, 11 Nov 2024 11:22:26 +0100 Uwe Kleine-König wrote: > [...] > > > > I guess the kernel provides a directory "/sys/class/infiniband_mad". Do > > > > its contents look different on 6.10.x and 6.11.x? > > > > > > I will look into this as soon as I can reboot the cluster head node. > > I looked into this, while testing the new Debian Linux kernel that has > just migrated to testing (which, once again, makes opensm fail to > start, just like other 6.11.x versions). > > With a working kernel: > > $ uname -v > #1 SMP PREEMPT_DYNAMIC Debian 6.10.11-1 (2024-09-22) > $ ls -altrF /sys/class/infiniband_mad/ > total 0 > lrwxrwxrwx 1 root root 0 Nov 4 15:58 umad0 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.0/infiniband_mad/umad0/ > lrwxrwxrwx 1 root root 0 Nov 4 15:58 umad1 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.1/infiniband_mad/umad1/ > lrwxrwxrwx 1 root root 0 Nov 11 15:54 issm1 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.1/infiniband_mad/issm1/ > lrwxrwxrwx 1 root root 0 Nov 11 15:54 issm0 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.0/infiniband_mad/issm0/ > drwxr-xr-x 2 root root 0 Nov 11 15:54 ./ > drwxr-xr-x 72 root root 0 Nov 11 15:54 ../ > -r--r--r-- 1 root root 4096 Nov 11 15:54 abi_version > $ cat /sys/class/infiniband_mad/abi_version > 5 > > With a kernel that makes opensm fail to start: > > $ uname -v > #1 SMP PREEMPT_DYNAMIC Debian 6.11.7-1 (2024-11-09) > $ ls -altrF /sys/class/infiniband_mad/ > total 0 > drwxr-xr-x 73 root root 0 Nov 18 09:41 ../ > -r--r--r-- 1 root root 4096 Nov 18 09:41 abi_version > lrwxrwxrwx 1 root root 0 Nov 18 09:41 umad0 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.0/infiniband_mad/umad0/ > lrwxrwxrwx 1 root root 0 Nov 18 09:41 umad1 -> ../../devices/pci0000:80/0000:80:01.1/0000:81:00.1/infiniband_mad/umad1/ > drwxr-xr-x 2 root root 0 Nov 18 09:43 ./ > $ cat /sys/class/infiniband_mad/abi_version > 5 > > As you can see, a couple of files (symlinks) are missing here... It looks like the commit that is biting you is https://git.kernel.org/linus/50660c5197f52b8137e223dc3ba8d43661179a1d So if you bisect, try 50660c5197f52b8137e223dc3ba8d43661179a1d and its parent 24943dcdc156cf294d97a36bf5c51168bf574c22 first. I don't know about infiniband, but I'd say: Either your machine doesn't have these issmX devices and opensm should cope with that, or these issmX devices are available then 50660c5197f52b8137e223dc3ba8d43661179a1d is buggy. > Does this ring a bell? It doesn't for me, but maybe Mark Zhang or someone else among the new recipients has an idea? Best regards Uwe
Attachment:
signature.asc
Description: PGP signature