Ceph cluster in Jewel 10.2.11 Mons & Hosts are on CentOS 7.5.1804 kernel 3.10.0-862.6.3.el7.x86_64 Everyday, we can see in ceph.log on Monitor a lot of logs like these : 2018-10-02 16:07:08.882374 osd.478 192.168.1.232:6838/7689 386 : cluster [WRN] map e612590 wrongly marked me down 2018-10-02 16:07:06.462653 osd.464 192.168.1.232:6830/6650 317 : cluster [WRN] map e612588 wrongly marked me down 2018-10-02 16:07:10.717673 osd.470 192.168.1.232:6836/7554 371 : cluster [WRN] map e612591 wrongly marked me down 2018-10-02 16:14:51.179945 osd.414 192.168.1.227:6808/4767 670 : cluster [WRN] map e612599 wrongly marked me down 2018-10-02 16:14:48.422442 osd.403 192.168.1.227:6832/6727 509 : cluster [WRN] map e612597 wrongly marked me down 2018-10-02 16:15:13.198180 osd.436 192.168.1.228:6828/6402 533 : cluster [WRN] map e612608 wrongly marked me down 2018-10-02 16:15:08.792369 osd.433 192.168.1.228:6832/6732 515 : cluster [WRN] map e612604 wrongly marked me down 2018-10-02 16:15:11.680405 osd.429 192.168.1.228:6838/7393 536 : cluster [WRN] map e612607 wrongly marked me down 2018-10-02 16:15:14.246717 osd.431 192.168.1.228:6822/5937 474 : cluster [WRN] map e612609 wrongly marked me down On the server 192.168.1.228 for example, the /var/log/messages looks like : Oct 2 16:15:02 bd-ceph-22 ceph-osd: 2018-10-02 16:15:02.935658 7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from 192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front 2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:42.935642) Oct 2 16:15:03 bd-ceph-22 ceph-osd: 2018-10-02 16:15:03.935841 7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from 192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front 2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:43.935824) Oct 2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.283822 7fe378c13700 -1 osd.426 612603 heartbeat_check: no reply from 192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:00.450196 front 2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:44.283811) Oct 2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.353645 7f1110a32700 -1 osd.438 612603 heartbeat_check: no reply from 192.168.1.212:6807 osd.186 since back 2018-10-02 16:14:59.700105 front 2018-10-02 16:14:43.884248 (cutoff 2018-10-02 16:14:44.353612) Oct 2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.373905 7f71375de700 -1 osd.432 612603 heartbeat_check: no reply from 192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front 2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:44.373897) Oct 2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.935997 7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from 192.168.1.215:6815 osd.242 since back 2018-10-02 16:15:04.369740 front 2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:44.935981) Oct 2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.007484 7f10d97ec700 -1 osd.438 612603 heartbeat_check: no reply from 192.168.1.212:6807 osd.186 since back 2018-10-02 16:14:59.700105 front 2018-10-02 16:14:43.884248 (cutoff 2018-10-02 16:14:45.007477) Oct 2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.017154 7fd4cee4d700 -1 osd.435 612603 heartbeat_check: no reply from 192.168.1.212:6833 osd.195 since back 2018-10-02 16:15:03.273909 front 2018-10-02 16:14:44.648411 (cutoff 2018-10-02 16:14:45.017106) Oct 2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.158580 7fe343c96700 -1 osd.426 612603 heartbeat_check: no reply from 192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:00.450196 front 2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:45.158567) Oct 2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.283983 7fe378c13700 -1 osd.426 612603 heartbeat_check: no reply from 192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:05.154458 front 2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:45.283975) There is no network problem at that time (i checked the logs on the host and on the switch). OSD logs shows nothing but "wrongly marked me down" and sessions reset due to this monitor action. As several OSDs are impacted, it looks like a host problem. The sysctl.conf is: net.core.rmem_max=56623104 net.core.wmem_max=56623104 net.core.rmem_default=56623104 net.core.wmem_default=56623104 net.core.optmem_max=40960 net.ipv4.tcp_rmem=4096 87380 56623104 net.ipv4.tcp_wmem=4096 65536 56623104 net.core.somaxconn=1024 net.core.netdev_max_backlog=50000 net.ipv4.tcp_max_syn_backlog=30000 net.ipv4.tcp_max_tw_buckets=2000000 net.ipv4.tcp_tw_reuse=1 net.ipv4.tcp_fin_timeout=10 net.ipv4.tcp_slow_start_after_idle=0 net.ipv4.udp_rmem_min=8192 net.ipv4.udp_wmem_min=8192 net.ipv4.conf.all.send_redirects=0 net.ipv4.conf.all.accept_redirects=0 net.ipv4.conf.all.accept_source_route=0 kernel.pid_max=4194303 fs.file-max=26234859 Does someone has any idea or has already met this behaviour ? _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com