Re: ceph rdma network connect refused

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



You can try to set nofile 102400. Maybe 10000000 is too big.


------------------ Original ------------------
From: "xl_3992@xxxxxx" <xl_3992@xxxxxx>;
Date: Wed, Dec 1, 2021 06:52 PM
To: "GHui"<ugiwgh@xxxxxx>;
Cc: "ceph-users"<ceph-users@xxxxxxx>;
Subject: Re:  Re: ceph rdma network connect refused

my network conf:

$ sudo cat /etc/ceph/ceph.conf
[global]
fsid = 5a09301c-755d-4258-b79a-0936388a261e
mon_host = [v2:192.168.2.19:3300,v1:192.168.2.19:6789],[v2:192.168.2.20:3300,v1:192.168.2.20:6789],[v2:192.168.2.21:3300,v1:192.168.2.21:6789]
mon_initial_members = MN-001.gz.cn,MN-002.gz.cn,MN-003.gz.cn
public_network = 192.168.2.0/24

i think  more than 16 nodes osd down because too many tcp connections. how many connections of ceph osd are supported?
my env info:
60 disks of  each node;
30 nodes of cluster

above, I should how to configure tcp param?
xl_3992@xxxxxx
 
From: GHui
Date: 2021-12-01 16:39
To: xl_3992@xxxxxx
CC: ceph-users
Subject:  Re: ceph rdma network connect refused
How did you set your network between mgr, mon, mds, and osd?
 
 
------------------ Original ------------------
From: "xl_3992@xxxxxx" <xl_3992@xxxxxx>;
Date: Tue, Nov 30, 2021 10:07 AM
To: "GHui"<ugiwgh@xxxxxx>;
Cc: "ceph-users"<ceph-users@xxxxxxx>;
Subject: Re:  Re: ceph rdma network connect refused
 
I do not try the higher version, just test in the 14.2.22 version
 
[store@xxxxxxxxxxxxxxxxxxxx ~]$ show_gids
DEV     PORT    INDEX   GID                                     IPv4            VER     DEV
---     ----    -----   ---                                     ------------    ---     ---
mlx5_0  1       0       fe80:0000:0000:0000:0e42:a1ff:fead:58b2                 v1      enp94s0f0
mlx5_0  1       1       fe80:0000:0000:0000:0e42:a1ff:fead:58b2                 v2      enp94s0f0
mlx5_1  1       0       fe80:0000:0000:0000:0e42:a1ff:fead:58b3                 v1      enp94s0f1
mlx5_1  1       1       fe80:0000:0000:0000:0e42:a1ff:fead:58b3                 v2      enp94s0f1
mlx5_bond_0     1       0       fe80:0000:0000:0000:0e42:a1ff:fead:4be6                 v1      bond0
mlx5_bond_0     1       1       fe80:0000:0000:0000:0e42:a1ff:fead:4be6                 v2      bond0
mlx5_bond_0     1       2       0000:0000:0000:0000:0000:ffff:0a5e:303c 192.168.10  v1      bond0
mlx5_bond_0     1       3       0000:0000:0000:0000:0000:ffff:0a5e:303c 192.168.10  v2      bond0
 
 
 
xl_3992@xxxxxx
From: GHui
Date: 2021-11-30 09:50
To: xl_3992@xxxxxx
CC: ceph-users
Subject:  Re: ceph rdma network connect refused
Which Ceph version do you use? Or where container images did you download?
 
------------------ Original ------------------
From: "xl_3992@xxxxxx" <xl_3992@xxxxxx>;
Date: Mon, Nov 29, 2021 11:27 AM
To: "ceph-users"<ceph-users@xxxxxxx>;
Subject:  ceph rdma network connect refused
 
I test rdma network with ceph, when nodes exceed 16, most of osds down; when nodes less 16 nodes , cluster health is ok;
who can help me?
 
 
error log output :
 
2021-11-29 10:53:06.884 7f0839fec700 -1 --2- 10.94.48.70:0/559149 >> [v2:10.94.48.66:7045/3543288,v1:10.94.48.66:7047/3543288] conn(0x5585a4b3ec00 0x5585bd816700 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner peer [v2:10.94.48.66:7045/3543288,v1:10.94.48.66:7047/3543288] is using msgr V1 protocol
2021-11-29 10:53:07.264 7f083a7ed700 -1 Infiniband send_msg send returned error 111: (111) Connection refused
2021-11-29 10:53:07.264 7f083a7ed700 -1 Infiniband send_msg send returned error 111: (111) Connection refused
2021-11-29 10:53:07.264 7f083a7ed700 -1 Infiniband send_msg send returned error 111: (111) Connection refused
2021-11-29 10:53:07.264 7f083a7ed700 -1 Infiniband send_msg send returned error 111: (111) Connection refused
2021-11-29 10:53:07.264 7f083a7ed700 -1 Infiniband send_msg send returned error 111: (111) Connection refused
 
follow “Bring Up Ceph RDMA - Developer's Guide”, my cluster conf:
 
#----------------------- RDMA ---------------------
ms_type = async+rdma
ms_cluster_type = async+rdma
ms_public_type = async+rdma
ms_async_rdma_device_name = mlx5_bond_0
ms_async_rdma_polling_us = 0
ms_async_rdma_local_gid = 0000:0000:0000:0000:0000:ffff:0a5e:3046
 
[osd]
osd_memory_target = 4294967296
 
nodes env:
[store@xxxxxxxxxxxxxxxxxxxx ~]$ ulimit
unlimited
[store@xxxxxxxxxxxxxxxxxxxx ~]$ ibdev2netdev
mlx5_0 port 1 ==> enp94s0f0 (Down)
mlx5_1 port 1 ==> enp94s0f1 (Down)
mlx5_bond_0 port 1 ==> bond0 (Up)
 
[store@xxxxxxxxxxxxxxxxxxxx ~]$ sudo cat /usr/lib/systemd/system/ceph-osd@.service
[Unit]
Description=Ceph object storage daemon osd.%i
PartOf=ceph-osd.target
After=network-online.target local-fs.target time-sync.target
Before=remote-fs-pre.target ceph-osd.target
Wants=network-online.target local-fs.target time-sync.target remote-fs-pre.target ceph-osd.target
 
[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
LimitMEMLOCK=infinity
EnvironmentFile=-/etc/sysconfig/ceph
Environment=CLUSTER=ceph
ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph
ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i
ExecReload=/bin/kill -HUP $MAINPID
LockPersonality=true
MemoryDenyWriteExecute=true
 
[Install]
WantedBy=ceph-osd.target
 
[store@xxxxxxxxxxxxxxxxxxxx ~]$ cat  /etc/security/limits.conf
root soft nofile 10000000
root hard nofile 10000000
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux