Re: mimic 13.2.6 too much broken connexions

Vincent Godin <vince.mlist@xxxxxxxxx> · Fri, 29 Nov 2019 10:57:47 +0100

Hello Franck,
Thank you for your help
Ceph is our Openstack main storage. We have 64 computes (ceph
clients), 36 Ceph-Hosts (client and cluster networks) and 3 Mons :  so
roughly 140 arp entries
Our ARP cache size is based on default so 128/512/1024. As 140 < 512,
default should works (i will check over time the arp cachesize however
We tried these settings below 2 weeks ago (we thought it should
improve our network) but it was worst !
net.core.rmem_max = 134217728     (for a 10Gbps with low latency)
net.core.wmem_max = 134217728     (for a 10Gbps with low latency)
net.core.netdev_max_backlog = 300000
net.core.somaxconn = 2000
net.ipv4.ip_local_port_range=’10000 65000’
net.ipv4.tcp_rmem = 4096 87380 134217728     (for a 10Gbps with low latency)
net.ipv4.tcp_wmem = 4096 87380 134217728     (for a 10Gbps with low latency)
net.ipv4.tcp_mtu_probing = 1
net.ipv4.tcp_sack = 0
net.ipv4.tcp_dsack = 0
net.ipv4.tcp_fack = 0
net.ipv4.tcp_fin_timeout = 20
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_max_syn_backlog = 30000

Client and Cluster Network have a 9000 MTU. Each OSD-Host has two
teaming (LACP): 2x10Gbps for client and 2x10Gbps for cluster. Client
network is one level-2 lan, idem for Cluster network
As i said we didn't see significant errors counters on switchs or server

Vincent

Le ven. 29 nov. 2019 à 09:30, Frank Schilder <frans@xxxxxx> a écrit :
>
> How large is your arp cache? We have seen ceph dropping connections as soon as the level-2 network (direct neighbours) is larger than the arp cache. We adjusted the following settings:
>
> # Increase ARP cache size to accommodate large level-2 client network.
> net.ipv4.neigh.default.gc_thresh1 = 1024
> net.ipv4.neigh.default.gc_thresh2 = 2048
> net.ipv4.neigh.default.gc_thresh3 = 4096
>
> Another important group of parameters for TCP connections seems to be these, with our values:
>
> ## Increase number of incoming connections. The value can be raised to bursts of request, default is 128
> net.core.somaxconn = 2048
> ## Increase number of incoming connections backlog, default is 1000
> net.core.netdev_max_backlog = 50000
> ## Maximum number of remembered connection requests, default is 128
> net.ipv4.tcp_max_syn_backlog = 30000
>
> With this, we got rid of dropped connections in a cluster of 20 ceph nodes and ca. 550 client nodes, accounting for about 1500 active ceph clients, 1400 cephfs and 170 RBD images.
>
> Best regards,
>
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Vincent Godin <vince.mlist@xxxxxxxxx>
> Sent: 27 November 2019 20:11:23
> To: Anthony D'Atri; ceph-users@xxxxxxx; Ceph Development
> Subject:  Re: mimic 13.2.6 too much broken connexions
>
> If it was a network issue, the counters should explose (as i said,
> with a log level of 5 on the messenger, we observed more then 80 000
> lossy channels per minute) but nothing abnormal is relevant on the
> counters (on switchs and servers)
> On the switchs  no drop, no crc error, no packet loss, only some
> output discards but not enough to be significant. On the NICs on the
> servers via ethtool -S, nothing is relevant.
> And as i said, an other mimic cluster with different hardware has the
> same behavior
> Ceph uses connexions pools from host to host but how does it check the
> availability of these connexions over the time ?
> And as the network doesn't seem to be guilty, what can explain these
> broken channels ?
>
> Le mer. 27 nov. 2019 à 19:05, Anthony D'Atri <aad@xxxxxxxxxxxxxx> a écrit :
> >
> > Are you bonding NIC ports?   If so do you have the correct hash policy defined? Have you looked at the *switch* side for packet loss, CRC errors, etc?   What you report could be consistent with this.  Since the host  interface for a given connection will vary by the bond hash, some OSD connections will use one port and some the other.   So if one port has switch side errors, or is blackholed on the switch, you could see some heart beating impacted but not others.
> >
> > Also make sure you have the optimal reporters value.
> >
> > > On Nov 27, 2019, at 7:31 AM, Vincent Godin <vince.mlist@xxxxxxxxx> wrote:
> > >
> > > Till i submit the mail below few days ago, we found some clues
> > > We observed a lot of lossy connexion like :
> > > ceph-osd.9.log:2019-11-27 11:03:49.369 7f6bb77d0700  0 --
> > > 192.168.4.181:6818/2281415 >> 192.168.4.41:0/1962809518
> > > conn(0x563979a9f600 :6818   s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
> > > pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy)
> > > channel (new one lossy=1)
> > > We raised the log of the messenger to 5/5 and observed for the whole
> > > cluster more than 80 000 lossy connexion per minute !!!
> > > We adjusted  the "ms_tcp_read_timeout" from 900 to 60 sec then no more
> > > lossy connexion in logs nor health check failed
> > > It's just a workaround but there is a real problem with these broken
> > > sessions and it leads to two
> > > assertions :
> > > - Ceph take too much time to detect broken session and should recycle quicker !
> > > - The reasons for these broken sessions ?
> > >
> > > We have a other mimic cluster on different hardware and observed the
> > > same behavior : lot of lossy sessions, slow ops and co.
> > > Symptoms are the same :
> > > - some OSDs on one host have no response from an other osd on a different hosts
> > > - after some time, slow ops are detected
> > > - sometime it leads to ioblocked
> > > - after about 15mn the problem vanish
> > >
> > > -----------
> > >
> > > Help on diag needed : heartbeat_failed
> > >
> > > We encounter a strange behavior on our Mimic 13.2.6 cluster. A any
> > > time, and without any load, some OSDs become unreachable from only
> > > some hosts. It last 10 mn and then the problem vanish.
> > > It 's not always the same OSDs and the same hosts. There is no network
> > > failure on any of the host (because only some OSDs become unreachable)
> > > nor disk freeze as we can see in our grafana dashboard. Logs message
> > > are :
> > > first msg :
> > > 2019-11-24 09:19:43.292 7fa9980fc700 -1 osd.596 146481
> > > heartbeat_check: no reply from 192.168.6.112:6817 osd.394 since back
> > > 2019-11-24 09:19:22.761142 front 2019-11-24 09:19:39.769138 (cutoff
> > > 2019-11-24 09:19:23.293436)
> > > last msg:
> > > 2019-11-24 09:30:33.735 7f632354f700 -1 osd.591 146481
> > > heartbeat_check: no reply from 192.168.6.123:6828 osd.600 since back
> > > 2019-11-24 09:27:05.269330 front 2019-11-24 09:30:33.214874 (cutoff
> > > 2019-11-24 09:30:13.736517)
> > > During this time, 3 hosts were involved : host-18, host-20 and host-30 :
> > > host-30 is the only one who can't see osds 346,356,and 352 on host-18
> > > host-30 is the only one who can't see osds 387 and 394 on host-20
> > > host-18 is the only one who can't see osds 583, 585, 591 and 597 on host-30
> > > We can't see any strange behavior on hosts 18, 20 and 30 in our node
> > > exporter data during this time
> > > Any ideas or advices ?
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx