Hello Long message as it has supporting data so apologies up front. With CX3 and mlx4 I have no issues with this working, but Dell and Red Hat see issues with CX6 adapters. I cannot see what I am doing wrong as the identical test works with CX3. I get this sequence I see the kato expire and the controller is torn down Target [ 162.276501] nvmet: adding nsid 1 to subsystem nqn.2023-10.org.dell [ 162.340724] nvmet_rdma: enabling port 1 (172.18.60.2:4420) [ 304.742924] nvmet: creating nvm controller 1 for subsystem nqn.2023- 10.org.dell for NQN nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0034-5310- 8057-b1c04f355333. [ 315.060743] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! [ 315.066667] nvmet: ctrl 1 fatal error occurred! [ 320.344443] nvmet: could not find controller 1 for subsys nqn.2023- 10.org.dell / host nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0034-5310- 8057-b1c04f355333 Initiator [root@rhel-storage-103 ~]# nvme connect -t rdma -n nqn.2023-10.org.dell -a 172.18.60.2 -s 4420 no controller found: failed to write to nvme-fabrics device [ 270.946125] nvme nvme4: creating 80 I/O queues. [ 286.530761] nvme nvme4: mapped 80/0/0 default/read/poll queues. [ 286.547112] nvme nvme4: Connect Invalid Data Parameter, cntlid: 1 [ 286.555181] nvme nvme4: failed to connect queue: 1 ret=16770 so TLDR but here are the gory details Supporting Data ---------------- Working setup, kernel is a kernel with commit 4cde03d82e2d0056d20fd5af6a264c7f5e6a3e76 Author: Daniel Wagner <dwagner@xxxxxxx> Date: Fri Jul 29 16:26:30 2022 +0200 nvme: consider also host_iface when checking ip options I tested with both IB RDMA and Ethernet, both work Currently configured for Ethernet Target [root@dl580 ~]# lspci | grep -i mell 8a:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3] [root@dl580 ~]# uname -a Linux dl580 5.14.0-284.25.1.nvmefix.el9.x86_64 ens4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.0.0.2 netmask 255.255.255.0 broadcast 10.0.0.255 ether f4:52:14:86:49:41 txqueuelen 1000 (Ethernet) RX packets 17 bytes 5610 (5.4 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 8 bytes 852 (852.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ens4d1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.1.0.2 netmask 255.255.255.0 broadcast 10.1.0.255 ether f4:52:14:86:49:42 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 8 bytes 852 (852.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 root@dl580 ~]# ibstat CA 'mlx4_0' CA type: MT4099 Number of ports: 2 Firmware version: 2.42.5000 Hardware version: 1 Node GUID: 0xf452140300864940 System image GUID: 0xf452140300864943 Port 1: State: Active Physical state: LinkUp Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xf65214fffe864941 Link layer: Ethernet Port 2: State: Active Physical state: LinkUp Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xf65214fffe864942 Link layer: Ethernet Initiator [root@dl380rhel9 ~]# lspci | grep -i mell 08:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3] Linux dl380rhel9 5.14.0-284.25.1.nvmefix.el9.x86_64 ens1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.0.0.1 netmask 255.255.255.0 broadcast 10.0.0.255 ether f4:52:14:67:6b:a1 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 56 bytes 9376 (9.1 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ens1d1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.1.0.1 netmask 255.255.255.0 broadcast 10.1.0.255 ether f4:52:14:67:6b:a2 txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [root@dl380rhel9 ~]# ibstat CA 'mlx4_0' CA type: MT4099 Number of ports: 2 Firmware version: 2.42.5000 Hardware version: 1 Node GUID: 0xf452140300676ba0 System image GUID: 0xf452140300676ba3 Port 1: State: Active Physical state: LinkUp Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xf65214fffe676ba1 Link layer: Ethernet Port 2: State: Active Physical state: LinkUp Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xf65214fffe676ba2 Link layer: Ethernet The Test is the same test failing in Red Hat lab on CX6 but working on CX3. Run this script on target, advertising on IP 10.1.0.2 [root@dl580 ~]# cat new_start_nvme_target.sh #!/bin/bash modprobe nvmet modprobe nvme-fc mkdir /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell cd /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell echo 1 > attr_allow_any_host mkdir namespaces/1 cd namespaces/1 echo -n /dev/nvme0n1> device_path echo 1 > enable cd mkdir /sys/kernel/config/nvmet/ports/1 cd /sys/kernel/config/nvmet/ports/1 echo 10.1.0.2 > addr_traddr echo rdma > addr_trtype echo 4420 > addr_trsvcid echo ipv4 > addr_adrfam ln -s /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell/ /sys/kernel/config/nvmet/ports/1/subsystems/nqn.2023-10.org.dell On initiator run modprobe nvme-fc nvme connect -t rdma -n nqn.2023-10.org.dell -a 10.1.0.2 -s 4420 Results - Red Hat LAB CX3 mlx4 Target [ 626.630914] nvmet: adding nsid 1 to subsystem nqn.2023-10.org.dell [ 626.654567] nvmet_rdma: enabling port 1 (10.1.0.2:4420) [ 685.041034] nvmet: creating nvm controller 1 for subsystem nqn.2023- 10.org.dell for NQN nqn.2014-08.org.nvmexpress:uuid:34333336-3530-4d32- 3232-303730304a36. Initiator [ 696.864671] nvme nvme0: creating 24 I/O queues. [ 697.370447] nvme nvme0: mapped 24/0/0 default/read/poll queues. [ 697.526386] nvme nvme0: new ctrl: NQN "nqn.2023-10.org.dell", addr 10.1.0.2:4420 [root@dl380rhel9 ~]# nvme list Node Generic SN Model Namespace Usage Format FW Rev --------------------- --------------------- -------------------- ------ ---------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme0n1 /dev/ng0n1 71cf88c9fd26d64268e2 Linux 1 500.11 GB / 500.11 GB 512 B + 0 B 5.14.0-2 All good Now Red Hat LAB with upstream 6.6 kernel ----------------------------------------- Here is latest upstream Target config Linux rhel-storage-105.storage.lab.eng.bos.redhat.com 6.6.0+ #2 SMP PREEMPT_DYNAMIC Wed Nov 8 09:53:23 EST 2023 x86_64 x86_64 x86_64 GNU/Linux [root@rhel-storage-105 ~]# ibstat CA 'mlx5_0' CA type: MT4119 Number of ports: 1 Firmware version: 16.35.1012 Hardware version: 0 Node GUID: 0xe8ebd30300558946 System image GUID: 0xe8ebd30300558946 Port 1: State: Active Physical state: LinkUp Rate: 25 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xeaebd3fffe558946 Link layer: Ethernet CA 'mlx5_1' CA type: MT4119 Number of ports: 1 Firmware version: 16.35.1012 Hardware version: 0 Node GUID: 0xe8ebd30300558947 System image GUID: 0xe8ebd30300558946 Port 1: State: Active Physical state: LinkUp Rate: 25 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xeaebd3fffe558947 Link layer: Ethernet CA 'mlx5_2' CA type: MT4125 Number of ports: 1 Firmware version: 22.36.1010 Hardware version: 0 Node GUID: 0x946dae0300d05002 System image GUID: 0x946dae0300d05002 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x966daefffed05002 Link layer: Ethernet CA 'mlx5_3' CA type: MT4125 Number of ports: 1 Firmware version: 22.36.1010 Hardware version: 0 Node GUID: 0x946dae0300d05003 System image GUID: 0x946dae0300d05002 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x966daefffed05003 Link layer: Ethernet Initiator config Linux rhel-storage-103.storage.lab.eng.bos.redhat.com 6.6.0+ #2 SMP PREEMPT_DYNAMIC Wed Nov 8 09:53:23 EST 2023 x86_64 x86_64 x86_64 GNU/Linux I decided to disable qla2xxx from loading in both root@rhel-storage-103 ~]# ibstat CA 'mlx5_0' CA type: MT4119 Number of ports: 1 Firmware version: 16.32.2004 Hardware version: 0 Node GUID: 0xe8ebd303003a1d0c System image GUID: 0xe8ebd303003a1d0c Port 1: State: Active Physical state: LinkUp Rate: 25 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xeaebd3fffe3a1d0c Link layer: Ethernet CA 'mlx5_1' CA type: MT4119 Number of ports: 1 Firmware version: 16.32.2004 Hardware version: 0 Node GUID: 0xe8ebd303003a1d0d System image GUID: 0xe8ebd303003a1d0c Port 1: State: Active Physical state: LinkUp Rate: 25 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0xeaebd3fffe3a1d0d Link layer: Ethernet CA 'mlx5_2' CA type: MT4125 Number of ports: 1 Firmware version: 22.36.1010 Hardware version: 0 Node GUID: 0x946dae0300d06d72 System image GUID: 0x946dae0300d06d72 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x966daefffed06d72 Link layer: Ethernet CA 'mlx5_3' CA type: MT4125 Number of ports: 1 Firmware version: 22.36.1010 Hardware version: 0 Node GUID: 0x946dae0300d06d73 System image GUID: 0x946dae0300d06d72 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x966daefffed06d73 Link layer: Ethernet Test Target #!/bin/bash modprobe nvmet modprobe nvme-fc mkdir /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell cd /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell echo 1 > attr_allow_any_host mkdir namespaces/1 cd namespaces/1 echo -n /dev/nvme0n1> device_path echo 1 > enable cd mkdir /sys/kernel/config/nvmet/ports/1 cd /sys/kernel/config/nvmet/ports/1 echo 172.18.60.2 > addr_traddr echo rdma > addr_trtype echo 4420 > addr_trsvcid echo ipv4 > addr_adrfam ln -s /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell/ /sys/kernel/config/nvmet/ports/1/subsystems/nqn.2023-10.org.dell [ 162.276501] nvmet: adding nsid 1 to subsystem nqn.2023-10.org.dell [ 162.340724] nvmet_rdma: enabling port 1 (172.18.60.2:4420) [ 304.742924] nvmet: creating nvm controller 1 for subsystem nqn.2023- 10.org.dell for NQN nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0034-5310- 8057-b1c04f355333. [ 315.060743] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! [ 315.066667] nvmet: ctrl 1 fatal error occurred! [ 320.344443] nvmet: could not find controller 1 for subsys nqn.2023- 10.org.dell / host nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0034-5310- 8057-b1c04f355333 Initiator Has some local NVME already Node Generic SN Model Namespace Usage Format FW Rev --------------------- --------------------- -------------------- ------ ---------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme3n1 /dev/ng3n1 72F0A021TC88 Dell Ent NVMe CM6 MU 1.6TB 1 2.14 GB / 1.60 TB 512 B + 0 B 2.1.8 /dev/nvme2n1 /dev/ng2n1 72F0A02CTC88 Dell Ent NVMe CM6 MU 1.6TB 1 2.27 MB / 1.60 TB 512 B + 0 B 2.1.8 /dev/nvme1n1 /dev/ng1n1 72F0A01DTC88 Dell Ent NVMe CM6 MU 1.6TB 1 544.21 MB / 1.60 TB 512 B + 0 B 2.1.8 /dev/nvme0n1 /dev/ng0n1 72F0A019TC88 Dell Ent NVMe CM6 MU 1.6TB 1 33.77 GB / 1.60 TB 512 B + 0 B 2.1.8 [root@rhel-storage-103 ~]# modprobe nvme-fc [root@rhel-storage-103 ~]# nvme connect -t rdma -n nqn.2023-10.org.dell -a 172.18.60.2 -s 4420 no controller found: failed to write to nvme-fabrics device [ 270.946125] nvme nvme4: creating 80 I/O queues. [ 286.530761] nvme nvme4: mapped 80/0/0 default/read/poll queues. [ 286.547112] nvme nvme4: Connect Invalid Data Parameter, cntlid: 1 [ 286.555181] nvme nvme4: failed to connect queue: 1 ret=16770