On Wed, 2023-11-08 at 14:02 -0500, Laurence Oberman wrote: > Hello > > Long message as it has supporting data so apologies up front. > With CX3 and mlx4 I have no issues with this working, but Dell and > Red > Hat see issues with CX6 adapters. > > I cannot see what I am doing wrong as the identical test works with > CX3. > > > I get this sequence > I see the kato expire and the controller is torn down > > Target > [ 162.276501] nvmet: adding nsid 1 to subsystem nqn.2023-10.org.dell > [ 162.340724] nvmet_rdma: enabling port 1 (172.18.60.2:4420) > [ 304.742924] nvmet: creating nvm controller 1 for subsystem > nqn.2023- > 10.org.dell for NQN nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0034- > 5310- > 8057-b1c04f355333. > [ 315.060743] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! > [ 315.066667] nvmet: ctrl 1 fatal error occurred! > [ 320.344443] nvmet: could not find controller 1 for subsys > nqn.2023- > 10.org.dell / host nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0034- > 5310- > 8057-b1c04f355333 > > Initiator > > [root@rhel-storage-103 ~]# nvme connect -t rdma -n nqn.2023- > 10.org.dell > -a 172.18.60.2 -s 4420 > > no controller found: failed to write to nvme-fabrics device > > [ 270.946125] nvme nvme4: creating 80 I/O queues. > [ 286.530761] nvme nvme4: mapped 80/0/0 default/read/poll queues. > [ 286.547112] nvme nvme4: Connect Invalid Data Parameter, cntlid: 1 > [ 286.555181] nvme nvme4: failed to connect queue: 1 ret=16770 > > so TLDR but here are the gory details > > Supporting Data > ---------------- > Working setup, kernel is a kernel with > > commit 4cde03d82e2d0056d20fd5af6a264c7f5e6a3e76 > Author: Daniel Wagner <dwagner@xxxxxxx> > Date: Fri Jul 29 16:26:30 2022 +0200 > > nvme: consider also host_iface when checking ip options > > > I tested with both IB RDMA and Ethernet, both work > Currently configured for Ethernet > > > Target > > [root@dl580 ~]# lspci | grep -i mell > 8a:00.0 Ethernet controller: Mellanox Technologies MT27500 Family > [ConnectX-3] > > [root@dl580 ~]# uname -a > Linux dl580 5.14.0-284.25.1.nvmefix.el9.x86_64 > > ens4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > inet 10.0.0.2 netmask 255.255.255.0 broadcast 10.0.0.255 > ether f4:52:14:86:49:41 txqueuelen 1000 (Ethernet) > RX packets 17 bytes 5610 (5.4 KiB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 8 bytes 852 (852.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > ens4d1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > inet 10.1.0.2 netmask 255.255.255.0 broadcast 10.1.0.255 > ether f4:52:14:86:49:42 txqueuelen 1000 (Ethernet) > RX packets 0 bytes 0 (0.0 B) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 8 bytes 852 (852.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > root@dl580 ~]# ibstat > CA 'mlx4_0' > CA type: MT4099 > Number of ports: 2 > Firmware version: 2.42.5000 > Hardware version: 1 > Node GUID: 0xf452140300864940 > System image GUID: 0xf452140300864943 > Port 1: > State: Active > Physical state: LinkUp > Rate: 10 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x00010000 > Port GUID: 0xf65214fffe864941 > Link layer: Ethernet > Port 2: > State: Active > Physical state: LinkUp > Rate: 10 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x00010000 > Port GUID: 0xf65214fffe864942 > Link layer: Ethernet > > > Initiator > > [root@dl380rhel9 ~]# lspci | grep -i mell > 08:00.0 Ethernet controller: Mellanox Technologies MT27500 Family > [ConnectX-3] > > Linux dl380rhel9 5.14.0-284.25.1.nvmefix.el9.x86_64 > > ens1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > inet 10.0.0.1 netmask 255.255.255.0 broadcast 10.0.0.255 > ether f4:52:14:67:6b:a1 txqueuelen 1000 (Ethernet) > RX packets 0 bytes 0 (0.0 B) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 56 bytes 9376 (9.1 KiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > ens1d1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > inet 10.1.0.1 netmask 255.255.255.0 broadcast 10.1.0.255 > ether f4:52:14:67:6b:a2 txqueuelen 1000 (Ethernet) > RX packets 0 bytes 0 (0.0 B) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 0 bytes 0 (0.0 B) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > > [root@dl380rhel9 ~]# ibstat > CA 'mlx4_0' > CA type: MT4099 > Number of ports: 2 > Firmware version: 2.42.5000 > Hardware version: 1 > Node GUID: 0xf452140300676ba0 > System image GUID: 0xf452140300676ba3 > Port 1: > State: Active > Physical state: LinkUp > Rate: 10 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x00010000 > Port GUID: 0xf65214fffe676ba1 > Link layer: Ethernet > Port 2: > State: Active > Physical state: LinkUp > Rate: 10 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x00010000 > Port GUID: 0xf65214fffe676ba2 > Link layer: Ethernet > > > The Test is the same test failing in Red Hat lab on CX6 but working > on > CX3. > > > Run this script on target, advertising on IP 10.1.0.2 > > [root@dl580 ~]# cat new_start_nvme_target.sh > #!/bin/bash > modprobe nvmet > modprobe nvme-fc > mkdir /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell > cd /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell > echo 1 > attr_allow_any_host > mkdir namespaces/1 > cd namespaces/1 > echo -n /dev/nvme0n1> device_path > echo 1 > enable > cd > mkdir /sys/kernel/config/nvmet/ports/1 > cd /sys/kernel/config/nvmet/ports/1 > echo 10.1.0.2 > addr_traddr > echo rdma > addr_trtype > echo 4420 > addr_trsvcid > echo ipv4 > addr_adrfam > ln -s /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell/ > /sys/kernel/config/nvmet/ports/1/subsystems/nqn.2023-10.org.dell > > > On initiator run > > modprobe nvme-fc > nvme connect -t rdma -n nqn.2023-10.org.dell -a 10.1.0.2 -s 4420 > > > > Results - Red Hat LAB CX3 mlx4 > > Target > [ 626.630914] nvmet: adding nsid 1 to subsystem nqn.2023-10.org.dell > [ 626.654567] nvmet_rdma: enabling port 1 (10.1.0.2:4420) > [ 685.041034] nvmet: creating nvm controller 1 for subsystem > nqn.2023- > 10.org.dell for NQN nqn.2014-08.org.nvmexpress:uuid:34333336-3530- > 4d32- > 3232-303730304a36. > > Initiator > > [ 696.864671] nvme nvme0: creating 24 I/O queues. > [ 697.370447] nvme nvme0: mapped 24/0/0 default/read/poll queues. > [ 697.526386] nvme nvme0: new ctrl: NQN "nqn.2023-10.org.dell", addr > 10.1.0.2:4420 > > [root@dl380rhel9 ~]# nvme list > Node Generic SN > Model > Namespace Usage Format FW Rev > --------------------- --------------------- -------------------- ---- > -- > ---------------------------------- --------- ------------------------ > -- > ---------------- -------- > /dev/nvme0n1 /dev/ng0n1 71cf88c9fd26d64268e2 > Linux > 1 500.11 GB / 500.11 GB 512 B + 0 B 5.14.0-2 > > > All good > > > Now Red Hat LAB with upstream 6.6 kernel > ----------------------------------------- > > Here is latest upstream > > > > Target config > > Linux rhel-storage-105.storage.lab.eng.bos.redhat.com 6.6.0+ #2 SMP > PREEMPT_DYNAMIC Wed Nov 8 09:53:23 EST 2023 x86_64 x86_64 x86_64 > GNU/Linux > > [root@rhel-storage-105 ~]# ibstat > CA 'mlx5_0' > CA type: MT4119 > Number of ports: 1 > Firmware version: 16.35.1012 > Hardware version: 0 > Node GUID: 0xe8ebd30300558946 > System image GUID: 0xe8ebd30300558946 > Port 1: > State: Active > Physical state: LinkUp > Rate: 25 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x00010000 > Port GUID: 0xeaebd3fffe558946 > Link layer: Ethernet > CA 'mlx5_1' > CA type: MT4119 > Number of ports: 1 > Firmware version: 16.35.1012 > Hardware version: 0 > Node GUID: 0xe8ebd30300558947 > System image GUID: 0xe8ebd30300558946 > Port 1: > State: Active > Physical state: LinkUp > Rate: 25 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x00010000 > Port GUID: 0xeaebd3fffe558947 > Link layer: Ethernet > CA 'mlx5_2' > CA type: MT4125 > Number of ports: 1 > Firmware version: 22.36.1010 > Hardware version: 0 > Node GUID: 0x946dae0300d05002 > System image GUID: 0x946dae0300d05002 > Port 1: > State: Active > Physical state: LinkUp > Rate: 100 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x00010000 > Port GUID: 0x966daefffed05002 > Link layer: Ethernet > CA 'mlx5_3' > CA type: MT4125 > Number of ports: 1 > Firmware version: 22.36.1010 > Hardware version: 0 > Node GUID: 0x946dae0300d05003 > System image GUID: 0x946dae0300d05002 > Port 1: > State: Active > Physical state: LinkUp > Rate: 100 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x00010000 > Port GUID: 0x966daefffed05003 > Link layer: Ethernet > > > Initiator config > > Linux rhel-storage-103.storage.lab.eng.bos.redhat.com 6.6.0+ #2 SMP > PREEMPT_DYNAMIC Wed Nov 8 09:53:23 EST 2023 x86_64 x86_64 x86_64 > GNU/Linux > > > I decided to disable qla2xxx from loading in both > > > root@rhel-storage-103 ~]# ibstat > CA 'mlx5_0' > CA type: MT4119 > Number of ports: 1 > Firmware version: 16.32.2004 > Hardware version: 0 > Node GUID: 0xe8ebd303003a1d0c > System image GUID: 0xe8ebd303003a1d0c > Port 1: > State: Active > Physical state: LinkUp > Rate: 25 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x00010000 > Port GUID: 0xeaebd3fffe3a1d0c > Link layer: Ethernet > CA 'mlx5_1' > CA type: MT4119 > Number of ports: 1 > Firmware version: 16.32.2004 > Hardware version: 0 > Node GUID: 0xe8ebd303003a1d0d > System image GUID: 0xe8ebd303003a1d0c > Port 1: > State: Active > Physical state: LinkUp > Rate: 25 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x00010000 > Port GUID: 0xeaebd3fffe3a1d0d > Link layer: Ethernet > CA 'mlx5_2' > CA type: MT4125 > Number of ports: 1 > Firmware version: 22.36.1010 > Hardware version: 0 > Node GUID: 0x946dae0300d06d72 > System image GUID: 0x946dae0300d06d72 > Port 1: > State: Active > Physical state: LinkUp > Rate: 100 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x00010000 > Port GUID: 0x966daefffed06d72 > Link layer: Ethernet > CA 'mlx5_3' > CA type: MT4125 > Number of ports: 1 > Firmware version: 22.36.1010 > Hardware version: 0 > Node GUID: 0x946dae0300d06d73 > System image GUID: 0x946dae0300d06d72 > Port 1: > State: Active > Physical state: LinkUp > Rate: 100 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x00010000 > Port GUID: 0x966daefffed06d73 > Link layer: Ethernet > > > > Test > > Target > > #!/bin/bash > modprobe nvmet > modprobe nvme-fc > mkdir /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell > cd /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell > echo 1 > attr_allow_any_host > mkdir namespaces/1 > cd namespaces/1 > echo -n /dev/nvme0n1> device_path > echo 1 > enable > cd > mkdir /sys/kernel/config/nvmet/ports/1 > cd /sys/kernel/config/nvmet/ports/1 > echo 172.18.60.2 > addr_traddr > echo rdma > addr_trtype > echo 4420 > addr_trsvcid > echo ipv4 > addr_adrfam > ln -s /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell/ > /sys/kernel/config/nvmet/ports/1/subsystems/nqn.2023-10.org.dell > > > > [ 162.276501] nvmet: adding nsid 1 to subsystem nqn.2023-10.org.dell > [ 162.340724] nvmet_rdma: enabling port 1 (172.18.60.2:4420) > [ 304.742924] nvmet: creating nvm controller 1 for subsystem > nqn.2023- > 10.org.dell for NQN nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0034- > 5310- > 8057-b1c04f355333. > [ 315.060743] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! > [ 315.066667] nvmet: ctrl 1 fatal error occurred! > [ 320.344443] nvmet: could not find controller 1 for subsys > nqn.2023- > 10.org.dell / host nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0034- > 5310- > 8057-b1c04f355333 > > > Initiator > > Has some local NVME already > > Node Generic SN > Model > Namespace Usage Format FW Rev > --------------------- --------------------- -------------------- ---- > -- > ---------------------------------- --------- ------------------------ > -- > ---------------- -------- > /dev/nvme3n1 /dev/ng3n1 72F0A021TC88 Dell > Ent NVMe CM6 MU 1.6TB 1 2.14 GB / 1.60 TB > 512 B + 0 B 2.1.8 > /dev/nvme2n1 /dev/ng2n1 72F0A02CTC88 Dell > Ent NVMe CM6 MU 1.6TB 1 2.27 MB / 1.60 TB > 512 B + 0 B 2.1.8 > /dev/nvme1n1 /dev/ng1n1 72F0A01DTC88 Dell > Ent NVMe CM6 MU 1.6TB 1 544.21 MB / 1.60 TB > 512 B + 0 B 2.1.8 > /dev/nvme0n1 /dev/ng0n1 72F0A019TC88 Dell > Ent NVMe CM6 MU 1.6TB 1 33.77 GB / 1.60 TB > 512 B + 0 B 2.1.8 > > [root@rhel-storage-103 ~]# modprobe nvme-fc > [root@rhel-storage-103 ~]# nvme connect -t rdma -n nqn.2023- > 10.org.dell > -a 172.18.60.2 -s 4420 > > no controller found: failed to write to nvme-fabrics device > > [ 270.946125] nvme nvme4: creating 80 I/O queues. > [ 286.530761] nvme nvme4: mapped 80/0/0 default/read/poll queues. > [ 286.547112] nvme nvme4: Connect Invalid Data Parameter, cntlid: 1 > [ 286.555181] nvme nvme4: failed to connect queue: 1 ret=16770 This patch fixes it diff -Nurp linux-5.14.0-284.25.1.el9_2.orig/drivers/nvme/host/nvme.h linux-5.14.0-284.25.1.el9_2/drivers/nvme/host/nvme.h --- linux-5.14.0-284.25.1.el9_2.orig/drivers/nvme/host/nvme.h 2023- 07-20 08:42:08.000000000 -0400 +++ linux-5.14.0-284.25.1.el9_2/drivers/nvme/host/nvme.h 2023- 11-08 14:16:37.924155469 -0500 @@ -25,7 +25,7 @@ extern unsigned int nvme_io_timeout; extern unsigned int admin_timeout; #define NVME_ADMIN_TIMEOUT (admin_timeout * HZ) -#define NVME_DEFAULT_KATO 5 +#define NVME_DEFAULT_KATO 30 #ifdef CONFIG_ARCH_NO_SG_CHAIN #define NVME_INLINE_SG_CNT 0 Seems 5s is tto short now [ 197.644691] nvmet: adding nsid 1 to subsystem nqn.2023-10.org.dell [ 197.684394] nvmet_rdma: enabling port 1 (172.18.60.2:4420) [ 203.224885] nvmet: creating nvm controller 1 for subsystem nqn.2023- 10.org.dell for NQN nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0034-5310- 8057-b1c04f355333. Initiator [ 171.306674] nvme nvme4: new ctrl: NQN "nqn.2023-10.org.dell", addr 172.18.60.2:4420 [ 171.308900] nvme4n1: So I dont see another wqay to change the kato