Strange Ceph host behaviour

Vincent Godin <vince.mlist@xxxxxxxxx> · Tue, 2 Oct 2018 17:17:52 +0200

Ceph cluster in Jewel 10.2.11
Mons & Hosts are on CentOS 7.5.1804 kernel 3.10.0-862.6.3.el7.x86_64

Everyday, we can see in ceph.log on Monitor a lot of logs like these :

2018-10-02 16:07:08.882374 osd.478 192.168.1.232:6838/7689 386 :
cluster [WRN] map e612590 wrongly marked me down
2018-10-02 16:07:06.462653 osd.464 192.168.1.232:6830/6650 317 :
cluster [WRN] map e612588 wrongly marked me down
2018-10-02 16:07:10.717673 osd.470 192.168.1.232:6836/7554 371 :
cluster [WRN] map e612591 wrongly marked me down
2018-10-02 16:14:51.179945 osd.414 192.168.1.227:6808/4767 670 :
cluster [WRN] map e612599 wrongly marked me down
2018-10-02 16:14:48.422442 osd.403 192.168.1.227:6832/6727 509 :
cluster [WRN] map e612597 wrongly marked me down
2018-10-02 16:15:13.198180 osd.436 192.168.1.228:6828/6402 533 :
cluster [WRN] map e612608 wrongly marked me down
2018-10-02 16:15:08.792369 osd.433 192.168.1.228:6832/6732 515 :
cluster [WRN] map e612604 wrongly marked me down
2018-10-02 16:15:11.680405 osd.429 192.168.1.228:6838/7393 536 :
cluster [WRN] map e612607 wrongly marked me down
2018-10-02 16:15:14.246717 osd.431 192.168.1.228:6822/5937 474 :
cluster [WRN] map e612609 wrongly marked me down

On the server 192.168.1.228 for example, the /var/log/messages looks like :

Oct  2 16:15:02 bd-ceph-22 ceph-osd: 2018-10-02 16:15:02.935658
7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from
192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front
2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:42.935642)
Oct  2 16:15:03 bd-ceph-22 ceph-osd: 2018-10-02 16:15:03.935841
7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from
192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front
2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:43.935824)
Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.283822
7fe378c13700 -1 osd.426 612603 heartbeat_check: no reply from
192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:00.450196 front
2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:44.283811)
Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.353645
7f1110a32700 -1 osd.438 612603 heartbeat_check: no reply from
192.168.1.212:6807 osd.186 since back 2018-10-02 16:14:59.700105 front
2018-10-02 16:14:43.884248 (cutoff 2018-10-02 16:14:44.353612)
Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.373905
7f71375de700 -1 osd.432 612603 heartbeat_check: no reply from
192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front
2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:44.373897)
Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.935997
7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from
192.168.1.215:6815 osd.242 since back 2018-10-02 16:15:04.369740 front
2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:44.935981)
Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.007484
7f10d97ec700 -1 osd.438 612603 heartbeat_check: no reply from
192.168.1.212:6807 osd.186 since back 2018-10-02 16:14:59.700105 front
2018-10-02 16:14:43.884248 (cutoff 2018-10-02 16:14:45.007477)
Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.017154
7fd4cee4d700 -1 osd.435 612603 heartbeat_check: no reply from
192.168.1.212:6833 osd.195 since back 2018-10-02 16:15:03.273909 front
2018-10-02 16:14:44.648411 (cutoff 2018-10-02 16:14:45.017106)
Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.158580
7fe343c96700 -1 osd.426 612603 heartbeat_check: no reply from
192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:00.450196 front
2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:45.158567)
Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.283983
7fe378c13700 -1 osd.426 612603 heartbeat_check: no reply from
192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:05.154458 front
2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:45.283975)

There is no network problem at that time (i checked the logs on the
host and on the switch). OSD logs shows nothing but "wrongly marked me
down" and sessions reset due to this monitor action. As several OSDs
are impacted, it looks like a host problem.

The sysctl.conf is:

net.core.rmem_max=56623104
net.core.wmem_max=56623104
net.core.rmem_default=56623104
net.core.wmem_default=56623104
net.core.optmem_max=40960
net.ipv4.tcp_rmem=4096 87380 56623104
net.ipv4.tcp_wmem=4096 65536 56623104
net.core.somaxconn=1024
net.core.netdev_max_backlog=50000
net.ipv4.tcp_max_syn_backlog=30000
net.ipv4.tcp_max_tw_buckets=2000000
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_fin_timeout=10
net.ipv4.tcp_slow_start_after_idle=0
net.ipv4.udp_rmem_min=8192
net.ipv4.udp_wmem_min=8192
net.ipv4.conf.all.send_redirects=0
net.ipv4.conf.all.accept_redirects=0
net.ipv4.conf.all.accept_source_route=0

kernel.pid_max=4194303
fs.file-max=26234859

Does someone has any idea or has already met this behaviour ?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com