Re: mimic 13.2.6 too much broken connexions

Frank Schilder <frans@xxxxxx> · Fri, 29 Nov 2019 08:30:13 +0000

How large is your arp cache? We have seen ceph dropping connections as soon as the level-2 network (direct neighbours) is larger than the arp cache. We adjusted the following settings:

# Increase ARP cache size to accommodate large level-2 client network.
net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh3 = 4096

Another important group of parameters for TCP connections seems to be these, with our values:

## Increase number of incoming connections. The value can be raised to bursts of request, default is 128
net.core.somaxconn = 2048
## Increase number of incoming connections backlog, default is 1000
net.core.netdev_max_backlog = 50000
## Maximum number of remembered connection requests, default is 128
net.ipv4.tcp_max_syn_backlog = 30000

With this, we got rid of dropped connections in a cluster of 20 ceph nodes and ca. 550 client nodes, accounting for about 1500 active ceph clients, 1400 cephfs and 170 RBD images.

Best regards,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Vincent Godin <vince.mlist@xxxxxxxxx>
Sent: 27 November 2019 20:11:23
To: Anthony D'Atri; ceph-users@xxxxxxx; Ceph Development
Subject:  Re: mimic 13.2.6 too much broken connexions

If it was a network issue, the counters should explose (as i said,
with a log level of 5 on the messenger, we observed more then 80 000
lossy channels per minute) but nothing abnormal is relevant on the
counters (on switchs and servers)
On the switchs  no drop, no crc error, no packet loss, only some
output discards but not enough to be significant. On the NICs on the
servers via ethtool -S, nothing is relevant.
And as i said, an other mimic cluster with different hardware has the
same behavior
Ceph uses connexions pools from host to host but how does it check the
availability of these connexions over the time ?
And as the network doesn't seem to be guilty, what can explain these
broken channels ?

Le mer. 27 nov. 2019 à 19:05, Anthony D'Atri <aad@xxxxxxxxxxxxxx> a écrit :
>
> Are you bonding NIC ports?   If so do you have the correct hash policy defined? Have you looked at the *switch* side for packet loss, CRC errors, etc?   What you report could be consistent with this.  Since the host  interface for a given connection will vary by the bond hash, some OSD connections will use one port and some the other.   So if one port has switch side errors, or is blackholed on the switch, you could see some heart beating impacted but not others.
>
> Also make sure you have the optimal reporters value.
>
> > On Nov 27, 2019, at 7:31 AM, Vincent Godin <vince.mlist@xxxxxxxxx> wrote:
> >
> > Till i submit the mail below few days ago, we found some clues
> > We observed a lot of lossy connexion like :
> > ceph-osd.9.log:2019-11-27 11:03:49.369 7f6bb77d0700  0 --
> > 192.168.4.181:6818/2281415 >> 192.168.4.41:0/1962809518
> > conn(0x563979a9f600 :6818   s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
> > pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy)
> > channel (new one lossy=1)
> > We raised the log of the messenger to 5/5 and observed for the whole
> > cluster more than 80 000 lossy connexion per minute !!!
> > We adjusted  the "ms_tcp_read_timeout" from 900 to 60 sec then no more
> > lossy connexion in logs nor health check failed
> > It's just a workaround but there is a real problem with these broken
> > sessions and it leads to two
> > assertions :
> > - Ceph take too much time to detect broken session and should recycle quicker !
> > - The reasons for these broken sessions ?
> >
> > We have a other mimic cluster on different hardware and observed the
> > same behavior : lot of lossy sessions, slow ops and co.
> > Symptoms are the same :
> > - some OSDs on one host have no response from an other osd on a different hosts
> > - after some time, slow ops are detected
> > - sometime it leads to ioblocked
> > - after about 15mn the problem vanish
> >
> > -----------
> >
> > Help on diag needed : heartbeat_failed
> >
> > We encounter a strange behavior on our Mimic 13.2.6 cluster. A any
> > time, and without any load, some OSDs become unreachable from only
> > some hosts. It last 10 mn and then the problem vanish.
> > It 's not always the same OSDs and the same hosts. There is no network
> > failure on any of the host (because only some OSDs become unreachable)
> > nor disk freeze as we can see in our grafana dashboard. Logs message
> > are :
> > first msg :
> > 2019-11-24 09:19:43.292 7fa9980fc700 -1 osd.596 146481
> > heartbeat_check: no reply from 192.168.6.112:6817 osd.394 since back
> > 2019-11-24 09:19:22.761142 front 2019-11-24 09:19:39.769138 (cutoff
> > 2019-11-24 09:19:23.293436)
> > last msg:
> > 2019-11-24 09:30:33.735 7f632354f700 -1 osd.591 146481
> > heartbeat_check: no reply from 192.168.6.123:6828 osd.600 since back
> > 2019-11-24 09:27:05.269330 front 2019-11-24 09:30:33.214874 (cutoff
> > 2019-11-24 09:30:13.736517)
> > During this time, 3 hosts were involved : host-18, host-20 and host-30 :
> > host-30 is the only one who can't see osds 346,356,and 352 on host-18
> > host-30 is the only one who can't see osds 387 and 394 on host-20
> > host-18 is the only one who can't see osds 583, 585, 591 and 597 on host-30
> > We can't see any strange behavior on hosts 18, 20 and 30 in our node
> > exporter data during this time
> > Any ideas or advices ?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx