How large is your arp cache? We have seen ceph dropping connections as soon as the level-2 network (direct neighbours) is larger than the arp cache. We adjusted the following settings: # Increase ARP cache size to accommodate large level-2 client network. net.ipv4.neigh.default.gc_thresh1 = 1024 net.ipv4.neigh.default.gc_thresh2 = 2048 net.ipv4.neigh.default.gc_thresh3 = 4096 Another important group of parameters for TCP connections seems to be these, with our values: ## Increase number of incoming connections. The value can be raised to bursts of request, default is 128 net.core.somaxconn = 2048 ## Increase number of incoming connections backlog, default is 1000 net.core.netdev_max_backlog = 50000 ## Maximum number of remembered connection requests, default is 128 net.ipv4.tcp_max_syn_backlog = 30000 With this, we got rid of dropped connections in a cluster of 20 ceph nodes and ca. 550 client nodes, accounting for about 1500 active ceph clients, 1400 cephfs and 170 RBD images. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Vincent Godin <vince.mlist@xxxxxxxxx> Sent: 27 November 2019 20:11:23 To: Anthony D'Atri; ceph-users@xxxxxxx; Ceph Development Subject: Re: mimic 13.2.6 too much broken connexions If it was a network issue, the counters should explose (as i said, with a log level of 5 on the messenger, we observed more then 80 000 lossy channels per minute) but nothing abnormal is relevant on the counters (on switchs and servers) On the switchs no drop, no crc error, no packet loss, only some output discards but not enough to be significant. On the NICs on the servers via ethtool -S, nothing is relevant. And as i said, an other mimic cluster with different hardware has the same behavior Ceph uses connexions pools from host to host but how does it check the availability of these connexions over the time ? And as the network doesn't seem to be guilty, what can explain these broken channels ? Le mer. 27 nov. 2019 à 19:05, Anthony D'Atri <aad@xxxxxxxxxxxxxx> a écrit : > > Are you bonding NIC ports? If so do you have the correct hash policy defined? Have you looked at the *switch* side for packet loss, CRC errors, etc? What you report could be consistent with this. Since the host interface for a given connection will vary by the bond hash, some OSD connections will use one port and some the other. So if one port has switch side errors, or is blackholed on the switch, you could see some heart beating impacted but not others. > > Also make sure you have the optimal reporters value. > > > On Nov 27, 2019, at 7:31 AM, Vincent Godin <vince.mlist@xxxxxxxxx> wrote: > > > > Till i submit the mail below few days ago, we found some clues > > We observed a lot of lossy connexion like : > > ceph-osd.9.log:2019-11-27 11:03:49.369 7f6bb77d0700 0 -- > > 192.168.4.181:6818/2281415 >> 192.168.4.41:0/1962809518 > > conn(0x563979a9f600 :6818 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH > > pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) > > channel (new one lossy=1) > > We raised the log of the messenger to 5/5 and observed for the whole > > cluster more than 80 000 lossy connexion per minute !!! > > We adjusted the "ms_tcp_read_timeout" from 900 to 60 sec then no more > > lossy connexion in logs nor health check failed > > It's just a workaround but there is a real problem with these broken > > sessions and it leads to two > > assertions : > > - Ceph take too much time to detect broken session and should recycle quicker ! > > - The reasons for these broken sessions ? > > > > We have a other mimic cluster on different hardware and observed the > > same behavior : lot of lossy sessions, slow ops and co. > > Symptoms are the same : > > - some OSDs on one host have no response from an other osd on a different hosts > > - after some time, slow ops are detected > > - sometime it leads to ioblocked > > - after about 15mn the problem vanish > > > > ----------- > > > > Help on diag needed : heartbeat_failed > > > > We encounter a strange behavior on our Mimic 13.2.6 cluster. A any > > time, and without any load, some OSDs become unreachable from only > > some hosts. It last 10 mn and then the problem vanish. > > It 's not always the same OSDs and the same hosts. There is no network > > failure on any of the host (because only some OSDs become unreachable) > > nor disk freeze as we can see in our grafana dashboard. Logs message > > are : > > first msg : > > 2019-11-24 09:19:43.292 7fa9980fc700 -1 osd.596 146481 > > heartbeat_check: no reply from 192.168.6.112:6817 osd.394 since back > > 2019-11-24 09:19:22.761142 front 2019-11-24 09:19:39.769138 (cutoff > > 2019-11-24 09:19:23.293436) > > last msg: > > 2019-11-24 09:30:33.735 7f632354f700 -1 osd.591 146481 > > heartbeat_check: no reply from 192.168.6.123:6828 osd.600 since back > > 2019-11-24 09:27:05.269330 front 2019-11-24 09:30:33.214874 (cutoff > > 2019-11-24 09:30:13.736517) > > During this time, 3 hosts were involved : host-18, host-20 and host-30 : > > host-30 is the only one who can't see osds 346,356,and 352 on host-18 > > host-30 is the only one who can't see osds 387 and 394 on host-20 > > host-18 is the only one who can't see osds 583, 585, 591 and 597 on host-30 > > We can't see any strange behavior on hosts 18, 20 and 30 in our node > > exporter data during this time > > Any ideas or advices ? _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx