Re: Mellanox CX6 and nvmet connectivity failure, happens on RHEL9.2 kernels and latest 6.6 upstream

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 2023-11-08 at 14:02 -0500, Laurence Oberman wrote:
> Hello
> 
> Long message as it has supporting data so apologies up front. 
> With CX3 and mlx4 I have no issues with this working,  but Dell and
> Red
> Hat see issues with CX6 adapters.
> 
> I cannot see what I am doing wrong as the identical test works with
> CX3.
> 
> 
> I get this sequence
> I see the kato expire and the controller is torn down
> 
> Target
> [  162.276501] nvmet: adding nsid 1 to subsystem nqn.2023-10.org.dell
> [  162.340724] nvmet_rdma: enabling port 1 (172.18.60.2:4420)
> [  304.742924] nvmet: creating nvm controller 1 for subsystem
> nqn.2023-
> 10.org.dell for NQN nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0034-
> 5310-
> 8057-b1c04f355333.
> [  315.060743] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
> [  315.066667] nvmet: ctrl 1 fatal error occurred!
> [  320.344443] nvmet: could not find controller 1 for subsys
> nqn.2023-
> 10.org.dell / host nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0034-
> 5310-
> 8057-b1c04f355333
> 
> Initiator
> 
> [root@rhel-storage-103 ~]# nvme connect -t rdma -n nqn.2023-
> 10.org.dell
> -a 172.18.60.2  -s 4420
> 
> no controller found: failed to write to nvme-fabrics device
> 
> [  270.946125] nvme nvme4: creating 80 I/O queues.
> [  286.530761] nvme nvme4: mapped 80/0/0 default/read/poll queues.
> [  286.547112] nvme nvme4: Connect Invalid Data Parameter, cntlid: 1
> [  286.555181] nvme nvme4: failed to connect queue: 1 ret=16770
> 
> so TLDR but here are the gory details
> 
> Supporting Data
> ----------------
> Working setup, kernel is a kernel with 
> 
> commit 4cde03d82e2d0056d20fd5af6a264c7f5e6a3e76
> Author: Daniel Wagner <dwagner@xxxxxxx>
> Date:   Fri Jul 29 16:26:30 2022 +0200
> 
>     nvme: consider also host_iface when checking ip options
> 
> 
> I tested with both IB RDMA and Ethernet, both work
> Currently configured for Ethernet
> 
> 
> Target
> 
> [root@dl580 ~]# lspci | grep -i mell
> 8a:00.0 Ethernet controller: Mellanox Technologies MT27500 Family
> [ConnectX-3]
> 
> [root@dl580 ~]# uname -a
> Linux dl580 5.14.0-284.25.1.nvmefix.el9.x86_64 
> 
> ens4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
>         inet 10.0.0.2  netmask 255.255.255.0  broadcast 10.0.0.255
>         ether f4:52:14:86:49:41  txqueuelen 1000  (Ethernet)
>         RX packets 17  bytes 5610 (5.4 KiB)
>         RX errors 0  dropped 0  overruns 0  frame 0
>         TX packets 8  bytes 852 (852.0 B)
>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> 
> ens4d1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
>         inet 10.1.0.2  netmask 255.255.255.0  broadcast 10.1.0.255
>         ether f4:52:14:86:49:42  txqueuelen 1000  (Ethernet)
>         RX packets 0  bytes 0 (0.0 B)
>         RX errors 0  dropped 0  overruns 0  frame 0
>         TX packets 8  bytes 852 (852.0 B)
>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> 
> 
> root@dl580 ~]# ibstat
> CA 'mlx4_0'
>         CA type: MT4099
>         Number of ports: 2
>         Firmware version: 2.42.5000
>         Hardware version: 1
>         Node GUID: 0xf452140300864940
>         System image GUID: 0xf452140300864943
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 10
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x00010000
>                 Port GUID: 0xf65214fffe864941
>                 Link layer: Ethernet
>         Port 2:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 10
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x00010000
>                 Port GUID: 0xf65214fffe864942
>                 Link layer: Ethernet
> 
> 
> Initiator
> 
> [root@dl380rhel9 ~]# lspci | grep -i mell
> 08:00.0 Ethernet controller: Mellanox Technologies MT27500 Family
> [ConnectX-3]
> 
> Linux dl380rhel9 5.14.0-284.25.1.nvmefix.el9.x86_64
> 
> ens1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
>         inet 10.0.0.1  netmask 255.255.255.0  broadcast 10.0.0.255
>         ether f4:52:14:67:6b:a1  txqueuelen 1000  (Ethernet)
>         RX packets 0  bytes 0 (0.0 B)
>         RX errors 0  dropped 0  overruns 0  frame 0
>         TX packets 56  bytes 9376 (9.1 KiB)
>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> 
> ens1d1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
>         inet 10.1.0.1  netmask 255.255.255.0  broadcast 10.1.0.255
>         ether f4:52:14:67:6b:a2  txqueuelen 1000  (Ethernet)
>         RX packets 0  bytes 0 (0.0 B)
>         RX errors 0  dropped 0  overruns 0  frame 0
>         TX packets 0  bytes 0 (0.0 B)
>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> 
> 
> [root@dl380rhel9 ~]# ibstat
> CA 'mlx4_0'
>         CA type: MT4099
>         Number of ports: 2
>         Firmware version: 2.42.5000
>         Hardware version: 1
>         Node GUID: 0xf452140300676ba0
>         System image GUID: 0xf452140300676ba3
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 10
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x00010000
>                 Port GUID: 0xf65214fffe676ba1
>                 Link layer: Ethernet
>         Port 2:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 10
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x00010000
>                 Port GUID: 0xf65214fffe676ba2
>                 Link layer: Ethernet
> 
> 
> The Test is the same test failing in Red Hat lab on CX6 but working
> on
> CX3.
> 
> 
> Run this script on target, advertising on IP 10.1.0.2
> 
> [root@dl580 ~]# cat new_start_nvme_target.sh 
> #!/bin/bash
> modprobe nvmet
> modprobe nvme-fc
> mkdir /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell
> cd /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell
> echo 1 > attr_allow_any_host
> mkdir namespaces/1
> cd namespaces/1
> echo -n /dev/nvme0n1> device_path
> echo 1 > enable
> cd
> mkdir /sys/kernel/config/nvmet/ports/1
> cd /sys/kernel/config/nvmet/ports/1
> echo 10.1.0.2 > addr_traddr
> echo rdma > addr_trtype
> echo 4420 > addr_trsvcid
> echo ipv4 > addr_adrfam
> ln -s /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell/
> /sys/kernel/config/nvmet/ports/1/subsystems/nqn.2023-10.org.dell
> 
> 
> On initiator run 
> 
> modprobe nvme-fc
> nvme connect -t rdma -n nqn.2023-10.org.dell -a 10.1.0.2 -s 4420
> 
> 
> 
> Results - Red Hat LAB CX3 mlx4
> 
> Target
> [  626.630914] nvmet: adding nsid 1 to subsystem nqn.2023-10.org.dell
> [  626.654567] nvmet_rdma: enabling port 1 (10.1.0.2:4420)
> [  685.041034] nvmet: creating nvm controller 1 for subsystem
> nqn.2023-
> 10.org.dell for NQN nqn.2014-08.org.nvmexpress:uuid:34333336-3530-
> 4d32-
> 3232-303730304a36.
> 
> Initiator
> 
> [  696.864671] nvme nvme0: creating 24 I/O queues.
> [  697.370447] nvme nvme0: mapped 24/0/0 default/read/poll queues.
> [  697.526386] nvme nvme0: new ctrl: NQN "nqn.2023-10.org.dell", addr
> 10.1.0.2:4420
> 
> [root@dl380rhel9 ~]# nvme list
> Node                  Generic               SN                  
> Model
> Namespace Usage                      Format           FW Rev  
> --------------------- --------------------- -------------------- ----
> --
> ---------------------------------- --------- ------------------------
> --
> ---------------- --------
> /dev/nvme0n1          /dev/ng0n1            71cf88c9fd26d64268e2
> Linux
> 1         500.11  GB / 500.11  GB    512   B +  0 B   5.14.0-2
> 
> 
> All good 
> 
> 
> Now Red Hat LAB with upstream 6.6 kernel
> -----------------------------------------
> 
> Here is latest upstream
> 
> 
> 
> Target config
> 
> Linux rhel-storage-105.storage.lab.eng.bos.redhat.com 6.6.0+ #2 SMP
> PREEMPT_DYNAMIC Wed Nov  8 09:53:23 EST 2023 x86_64 x86_64 x86_64
> GNU/Linux
> 
> [root@rhel-storage-105 ~]# ibstat
> CA 'mlx5_0'
>         CA type: MT4119
>         Number of ports: 1
>         Firmware version: 16.35.1012
>         Hardware version: 0
>         Node GUID: 0xe8ebd30300558946
>         System image GUID: 0xe8ebd30300558946
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 25
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x00010000
>                 Port GUID: 0xeaebd3fffe558946
>                 Link layer: Ethernet
> CA 'mlx5_1'
>         CA type: MT4119
>         Number of ports: 1
>         Firmware version: 16.35.1012
>         Hardware version: 0
>         Node GUID: 0xe8ebd30300558947
>         System image GUID: 0xe8ebd30300558946
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 25
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x00010000
>                 Port GUID: 0xeaebd3fffe558947
>                 Link layer: Ethernet
> CA 'mlx5_2'
>         CA type: MT4125
>         Number of ports: 1
>         Firmware version: 22.36.1010
>         Hardware version: 0
>         Node GUID: 0x946dae0300d05002
>         System image GUID: 0x946dae0300d05002
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 100
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x00010000
>                 Port GUID: 0x966daefffed05002
>                 Link layer: Ethernet
> CA 'mlx5_3'
>         CA type: MT4125
>         Number of ports: 1
>         Firmware version: 22.36.1010
>         Hardware version: 0
>         Node GUID: 0x946dae0300d05003
>         System image GUID: 0x946dae0300d05002
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 100
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x00010000
>                 Port GUID: 0x966daefffed05003
>                 Link layer: Ethernet
> 
> 
> Initiator config
> 
> Linux rhel-storage-103.storage.lab.eng.bos.redhat.com 6.6.0+ #2 SMP
> PREEMPT_DYNAMIC Wed Nov  8 09:53:23 EST 2023 x86_64 x86_64 x86_64
> GNU/Linux
> 
> 
> I decided to disable qla2xxx from loading in both
> 
> 
> root@rhel-storage-103 ~]# ibstat
> CA 'mlx5_0'
>         CA type: MT4119
>         Number of ports: 1
>         Firmware version: 16.32.2004
>         Hardware version: 0
>         Node GUID: 0xe8ebd303003a1d0c
>         System image GUID: 0xe8ebd303003a1d0c
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 25
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x00010000
>                 Port GUID: 0xeaebd3fffe3a1d0c
>                 Link layer: Ethernet
> CA 'mlx5_1'
>         CA type: MT4119
>         Number of ports: 1
>         Firmware version: 16.32.2004
>         Hardware version: 0
>         Node GUID: 0xe8ebd303003a1d0d
>         System image GUID: 0xe8ebd303003a1d0c
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 25
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x00010000
>                 Port GUID: 0xeaebd3fffe3a1d0d
>                 Link layer: Ethernet
> CA 'mlx5_2'
>         CA type: MT4125
>         Number of ports: 1
>         Firmware version: 22.36.1010
>         Hardware version: 0
>         Node GUID: 0x946dae0300d06d72
>         System image GUID: 0x946dae0300d06d72
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 100
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x00010000
>                 Port GUID: 0x966daefffed06d72
>                 Link layer: Ethernet
> CA 'mlx5_3'
>         CA type: MT4125
>         Number of ports: 1
>         Firmware version: 22.36.1010
>         Hardware version: 0
>         Node GUID: 0x946dae0300d06d73
>         System image GUID: 0x946dae0300d06d72
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 100
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x00010000
>                 Port GUID: 0x966daefffed06d73
>                 Link layer: Ethernet
> 
> 
> 
> Test
> 
> Target
> 
> #!/bin/bash
> modprobe nvmet
> modprobe nvme-fc
> mkdir /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell
> cd /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell
> echo 1 > attr_allow_any_host
> mkdir namespaces/1
> cd namespaces/1
> echo -n /dev/nvme0n1> device_path
> echo 1 > enable
> cd
> mkdir /sys/kernel/config/nvmet/ports/1
> cd /sys/kernel/config/nvmet/ports/1
> echo 172.18.60.2 > addr_traddr
> echo rdma > addr_trtype
> echo 4420 > addr_trsvcid
> echo ipv4 > addr_adrfam
> ln -s /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell/
> /sys/kernel/config/nvmet/ports/1/subsystems/nqn.2023-10.org.dell
> 
> 
> 
> [  162.276501] nvmet: adding nsid 1 to subsystem nqn.2023-10.org.dell
> [  162.340724] nvmet_rdma: enabling port 1 (172.18.60.2:4420)
> [  304.742924] nvmet: creating nvm controller 1 for subsystem
> nqn.2023-
> 10.org.dell for NQN nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0034-
> 5310-
> 8057-b1c04f355333.
> [  315.060743] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
> [  315.066667] nvmet: ctrl 1 fatal error occurred!
> [  320.344443] nvmet: could not find controller 1 for subsys
> nqn.2023-
> 10.org.dell / host nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0034-
> 5310-
> 8057-b1c04f355333
> 
> 
> Initiator
> 
> Has some local NVME already
> 
> Node                  Generic               SN                  
> Model
> Namespace Usage                      Format           FW Rev  
> --------------------- --------------------- -------------------- ----
> --
> ---------------------------------- --------- ------------------------
> --
> ---------------- --------
> /dev/nvme3n1          /dev/ng3n1            72F0A021TC88         Dell
> Ent NVMe CM6 MU 1.6TB               1           2.14  GB /   1.60  TB
> 512   B +  0 B   2.1.8   
> /dev/nvme2n1          /dev/ng2n1            72F0A02CTC88         Dell
> Ent NVMe CM6 MU 1.6TB               1           2.27  MB /   1.60  TB
> 512   B +  0 B   2.1.8   
> /dev/nvme1n1          /dev/ng1n1            72F0A01DTC88         Dell
> Ent NVMe CM6 MU 1.6TB               1         544.21  MB /   1.60  TB
> 512   B +  0 B   2.1.8   
> /dev/nvme0n1          /dev/ng0n1            72F0A019TC88         Dell
> Ent NVMe CM6 MU 1.6TB               1          33.77  GB /   1.60  TB
> 512   B +  0 B   2.1.8   
> 
> [root@rhel-storage-103 ~]# modprobe nvme-fc
> [root@rhel-storage-103 ~]# nvme connect -t rdma -n nqn.2023-
> 10.org.dell
> -a  172.18.60.2  -s 4420
> 
> no controller found: failed to write to nvme-fabrics device
> 
> [  270.946125] nvme nvme4: creating 80 I/O queues.
> [  286.530761] nvme nvme4: mapped 80/0/0 default/read/poll queues.
> [  286.547112] nvme nvme4: Connect Invalid Data Parameter, cntlid: 1
> [  286.555181] nvme nvme4: failed to connect queue: 1 ret=16770


This patch fixes it

diff -Nurp linux-5.14.0-284.25.1.el9_2.orig/drivers/nvme/host/nvme.h
linux-5.14.0-284.25.1.el9_2/drivers/nvme/host/nvme.h
--- linux-5.14.0-284.25.1.el9_2.orig/drivers/nvme/host/nvme.h	2023-
07-20 08:42:08.000000000 -0400
+++ linux-5.14.0-284.25.1.el9_2/drivers/nvme/host/nvme.h	2023-
11-08 14:16:37.924155469 -0500
@@ -25,7 +25,7 @@ extern unsigned int nvme_io_timeout;
 extern unsigned int admin_timeout;
 #define NVME_ADMIN_TIMEOUT	(admin_timeout * HZ)
 
-#define NVME_DEFAULT_KATO	5
+#define NVME_DEFAULT_KATO	30
 
 #ifdef CONFIG_ARCH_NO_SG_CHAIN
 #define  NVME_INLINE_SG_CNT  0


Seems 5s is tto short now 

[  197.644691] nvmet: adding nsid 1 to subsystem nqn.2023-10.org.dell
[  197.684394] nvmet_rdma: enabling port 1 (172.18.60.2:4420)
[  203.224885] nvmet: creating nvm controller 1 for subsystem nqn.2023-
10.org.dell for NQN nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0034-5310-
8057-b1c04f355333.


Initiator
[  171.306674] nvme nvme4: new ctrl: NQN "nqn.2023-10.org.dell", addr
172.18.60.2:4420
[  171.308900]  nvme4n1:

So I dont see another wqay to change the kato





[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux