Mellanox CX6 and nvmet connectivity failure, happens on RHEL9.2 kernels and latest 6.6 upstream

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello

Long message as it has supporting data so apologies up front. 
With CX3 and mlx4 I have no issues with this working,  but Dell and Red
Hat see issues with CX6 adapters.

I cannot see what I am doing wrong as the identical test works with
CX3.


I get this sequence
I see the kato expire and the controller is torn down

Target
[  162.276501] nvmet: adding nsid 1 to subsystem nqn.2023-10.org.dell
[  162.340724] nvmet_rdma: enabling port 1 (172.18.60.2:4420)
[  304.742924] nvmet: creating nvm controller 1 for subsystem nqn.2023-
10.org.dell for NQN nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0034-5310-
8057-b1c04f355333.
[  315.060743] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
[  315.066667] nvmet: ctrl 1 fatal error occurred!
[  320.344443] nvmet: could not find controller 1 for subsys nqn.2023-
10.org.dell / host nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0034-5310-
8057-b1c04f355333

Initiator

[root@rhel-storage-103 ~]# nvme connect -t rdma -n nqn.2023-10.org.dell
-a 172.18.60.2  -s 4420

no controller found: failed to write to nvme-fabrics device

[  270.946125] nvme nvme4: creating 80 I/O queues.
[  286.530761] nvme nvme4: mapped 80/0/0 default/read/poll queues.
[  286.547112] nvme nvme4: Connect Invalid Data Parameter, cntlid: 1
[  286.555181] nvme nvme4: failed to connect queue: 1 ret=16770

so TLDR but here are the gory details

Supporting Data
----------------
Working setup, kernel is a kernel with 

commit 4cde03d82e2d0056d20fd5af6a264c7f5e6a3e76
Author: Daniel Wagner <dwagner@xxxxxxx>
Date:   Fri Jul 29 16:26:30 2022 +0200

    nvme: consider also host_iface when checking ip options


I tested with both IB RDMA and Ethernet, both work
Currently configured for Ethernet


Target

[root@dl580 ~]# lspci | grep -i mell
8a:00.0 Ethernet controller: Mellanox Technologies MT27500 Family
[ConnectX-3]

[root@dl580 ~]# uname -a
Linux dl580 5.14.0-284.25.1.nvmefix.el9.x86_64 

ens4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.0.2  netmask 255.255.255.0  broadcast 10.0.0.255
        ether f4:52:14:86:49:41  txqueuelen 1000  (Ethernet)
        RX packets 17  bytes 5610 (5.4 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8  bytes 852 (852.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens4d1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.1.0.2  netmask 255.255.255.0  broadcast 10.1.0.255
        ether f4:52:14:86:49:42  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8  bytes 852 (852.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


root@dl580 ~]# ibstat
CA 'mlx4_0'
	CA type: MT4099
	Number of ports: 2
	Firmware version: 2.42.5000
	Hardware version: 1
	Node GUID: 0xf452140300864940
	System image GUID: 0xf452140300864943
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 10
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0xf65214fffe864941
		Link layer: Ethernet
	Port 2:
		State: Active
		Physical state: LinkUp
		Rate: 10
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0xf65214fffe864942
		Link layer: Ethernet


Initiator

[root@dl380rhel9 ~]# lspci | grep -i mell
08:00.0 Ethernet controller: Mellanox Technologies MT27500 Family
[ConnectX-3]

Linux dl380rhel9 5.14.0-284.25.1.nvmefix.el9.x86_64

ens1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.0.1  netmask 255.255.255.0  broadcast 10.0.0.255
        ether f4:52:14:67:6b:a1  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 56  bytes 9376 (9.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens1d1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.1.0.1  netmask 255.255.255.0  broadcast 10.1.0.255
        ether f4:52:14:67:6b:a2  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


[root@dl380rhel9 ~]# ibstat
CA 'mlx4_0'
	CA type: MT4099
	Number of ports: 2
	Firmware version: 2.42.5000
	Hardware version: 1
	Node GUID: 0xf452140300676ba0
	System image GUID: 0xf452140300676ba3
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 10
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0xf65214fffe676ba1
		Link layer: Ethernet
	Port 2:
		State: Active
		Physical state: LinkUp
		Rate: 10
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0xf65214fffe676ba2
		Link layer: Ethernet


The Test is the same test failing in Red Hat lab on CX6 but working on
CX3.


Run this script on target, advertising on IP 10.1.0.2

[root@dl580 ~]# cat new_start_nvme_target.sh 
#!/bin/bash
modprobe nvmet
modprobe nvme-fc
mkdir /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell
cd /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell
echo 1 > attr_allow_any_host
mkdir namespaces/1
cd namespaces/1
echo -n /dev/nvme0n1> device_path
echo 1 > enable
cd
mkdir /sys/kernel/config/nvmet/ports/1
cd /sys/kernel/config/nvmet/ports/1
echo 10.1.0.2 > addr_traddr
echo rdma > addr_trtype
echo 4420 > addr_trsvcid
echo ipv4 > addr_adrfam
ln -s /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell/
/sys/kernel/config/nvmet/ports/1/subsystems/nqn.2023-10.org.dell


On initiator run 

modprobe nvme-fc
nvme connect -t rdma -n nqn.2023-10.org.dell -a 10.1.0.2 -s 4420



Results - Red Hat LAB CX3 mlx4

Target
[  626.630914] nvmet: adding nsid 1 to subsystem nqn.2023-10.org.dell
[  626.654567] nvmet_rdma: enabling port 1 (10.1.0.2:4420)
[  685.041034] nvmet: creating nvm controller 1 for subsystem nqn.2023-
10.org.dell for NQN nqn.2014-08.org.nvmexpress:uuid:34333336-3530-4d32-
3232-303730304a36.

Initiator

[  696.864671] nvme nvme0: creating 24 I/O queues.
[  697.370447] nvme nvme0: mapped 24/0/0 default/read/poll queues.
[  697.526386] nvme nvme0: new ctrl: NQN "nqn.2023-10.org.dell", addr
10.1.0.2:4420

[root@dl380rhel9 ~]# nvme list
Node                  Generic               SN                   Model
Namespace Usage                      Format           FW Rev  
--------------------- --------------------- -------------------- ------
---------------------------------- --------- --------------------------
---------------- --------
/dev/nvme0n1          /dev/ng0n1            71cf88c9fd26d64268e2 Linux
1         500.11  GB / 500.11  GB    512   B +  0 B   5.14.0-2


All good 


Now Red Hat LAB with upstream 6.6 kernel
-----------------------------------------

Here is latest upstream



Target config

Linux rhel-storage-105.storage.lab.eng.bos.redhat.com 6.6.0+ #2 SMP
PREEMPT_DYNAMIC Wed Nov  8 09:53:23 EST 2023 x86_64 x86_64 x86_64
GNU/Linux

[root@rhel-storage-105 ~]# ibstat
CA 'mlx5_0'
	CA type: MT4119
	Number of ports: 1
	Firmware version: 16.35.1012
	Hardware version: 0
	Node GUID: 0xe8ebd30300558946
	System image GUID: 0xe8ebd30300558946
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 25
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0xeaebd3fffe558946
		Link layer: Ethernet
CA 'mlx5_1'
	CA type: MT4119
	Number of ports: 1
	Firmware version: 16.35.1012
	Hardware version: 0
	Node GUID: 0xe8ebd30300558947
	System image GUID: 0xe8ebd30300558946
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 25
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0xeaebd3fffe558947
		Link layer: Ethernet
CA 'mlx5_2'
	CA type: MT4125
	Number of ports: 1
	Firmware version: 22.36.1010
	Hardware version: 0
	Node GUID: 0x946dae0300d05002
	System image GUID: 0x946dae0300d05002
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x966daefffed05002
		Link layer: Ethernet
CA 'mlx5_3'
	CA type: MT4125
	Number of ports: 1
	Firmware version: 22.36.1010
	Hardware version: 0
	Node GUID: 0x946dae0300d05003
	System image GUID: 0x946dae0300d05002
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x966daefffed05003
		Link layer: Ethernet


Initiator config

Linux rhel-storage-103.storage.lab.eng.bos.redhat.com 6.6.0+ #2 SMP
PREEMPT_DYNAMIC Wed Nov  8 09:53:23 EST 2023 x86_64 x86_64 x86_64
GNU/Linux


I decided to disable qla2xxx from loading in both


root@rhel-storage-103 ~]# ibstat
CA 'mlx5_0'
	CA type: MT4119
	Number of ports: 1
	Firmware version: 16.32.2004
	Hardware version: 0
	Node GUID: 0xe8ebd303003a1d0c
	System image GUID: 0xe8ebd303003a1d0c
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 25
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0xeaebd3fffe3a1d0c
		Link layer: Ethernet
CA 'mlx5_1'
	CA type: MT4119
	Number of ports: 1
	Firmware version: 16.32.2004
	Hardware version: 0
	Node GUID: 0xe8ebd303003a1d0d
	System image GUID: 0xe8ebd303003a1d0c
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 25
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0xeaebd3fffe3a1d0d
		Link layer: Ethernet
CA 'mlx5_2'
	CA type: MT4125
	Number of ports: 1
	Firmware version: 22.36.1010
	Hardware version: 0
	Node GUID: 0x946dae0300d06d72
	System image GUID: 0x946dae0300d06d72
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x966daefffed06d72
		Link layer: Ethernet
CA 'mlx5_3'
	CA type: MT4125
	Number of ports: 1
	Firmware version: 22.36.1010
	Hardware version: 0
	Node GUID: 0x946dae0300d06d73
	System image GUID: 0x946dae0300d06d72
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x966daefffed06d73
		Link layer: Ethernet



Test

Target

#!/bin/bash
modprobe nvmet
modprobe nvme-fc
mkdir /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell
cd /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell
echo 1 > attr_allow_any_host
mkdir namespaces/1
cd namespaces/1
echo -n /dev/nvme0n1> device_path
echo 1 > enable
cd
mkdir /sys/kernel/config/nvmet/ports/1
cd /sys/kernel/config/nvmet/ports/1
echo 172.18.60.2 > addr_traddr
echo rdma > addr_trtype
echo 4420 > addr_trsvcid
echo ipv4 > addr_adrfam
ln -s /sys/kernel/config/nvmet/subsystems/nqn.2023-10.org.dell/
/sys/kernel/config/nvmet/ports/1/subsystems/nqn.2023-10.org.dell



[  162.276501] nvmet: adding nsid 1 to subsystem nqn.2023-10.org.dell
[  162.340724] nvmet_rdma: enabling port 1 (172.18.60.2:4420)
[  304.742924] nvmet: creating nvm controller 1 for subsystem nqn.2023-
10.org.dell for NQN nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0034-5310-
8057-b1c04f355333.
[  315.060743] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
[  315.066667] nvmet: ctrl 1 fatal error occurred!
[  320.344443] nvmet: could not find controller 1 for subsys nqn.2023-
10.org.dell / host nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0034-5310-
8057-b1c04f355333


Initiator

Has some local NVME already

Node                  Generic               SN                   Model
Namespace Usage                      Format           FW Rev  
--------------------- --------------------- -------------------- ------
---------------------------------- --------- --------------------------
---------------- --------
/dev/nvme3n1          /dev/ng3n1            72F0A021TC88         Dell
Ent NVMe CM6 MU 1.6TB               1           2.14  GB /   1.60  TB 
512   B +  0 B   2.1.8   
/dev/nvme2n1          /dev/ng2n1            72F0A02CTC88         Dell
Ent NVMe CM6 MU 1.6TB               1           2.27  MB /   1.60  TB 
512   B +  0 B   2.1.8   
/dev/nvme1n1          /dev/ng1n1            72F0A01DTC88         Dell
Ent NVMe CM6 MU 1.6TB               1         544.21  MB /   1.60  TB 
512   B +  0 B   2.1.8   
/dev/nvme0n1          /dev/ng0n1            72F0A019TC88         Dell
Ent NVMe CM6 MU 1.6TB               1          33.77  GB /   1.60  TB 
512   B +  0 B   2.1.8   

[root@rhel-storage-103 ~]# modprobe nvme-fc
[root@rhel-storage-103 ~]# nvme connect -t rdma -n nqn.2023-10.org.dell
-a  172.18.60.2  -s 4420

no controller found: failed to write to nvme-fabrics device

[  270.946125] nvme nvme4: creating 80 I/O queues.
[  286.530761] nvme nvme4: mapped 80/0/0 default/read/poll queues.
[  286.547112] nvme nvme4: Connect Invalid Data Parameter, cntlid: 1
[  286.555181] nvme nvme4: failed to connect queue: 1 ret=16770





[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux