Continuous error: "libceph: monX session lost, hunting for new mon" on one host

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello

I have a CEPH cluster with 3 nodes, each with 3 OSDs, running Proxmox, CEPH  versions:

{
    "mon": {
        "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 3
    },
    "mgr": {
        "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 3
    },
    "osd": {
        "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 9
    },
    "mds": {},
    "overall": {
        "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 15
    }
}

CEPH has public and cluster network on 10.10.10.0/24, the three nodes are 10.10.10.251, 10.10.10.252, 10.10.10.253 and networking is working good (I kept ping from one of the nodes to the others two running for hours and had 0 packet loss)

On one node with ip 10.10.10.252 I get strange message in dmesg

kern  :info  : [Oct23 14:42] libceph: mon2 10.10.10.253:6789 session lost, hunting for new mon
kern  :info  : [  +0.000391] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [ +30.721869] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
kern  :info  : [  +0.000749] libceph: mon2 10.10.10.253:6789 session established
kern  :info  : [Oct23 14:43] libceph: mon2 10.10.10.253:6789 session lost, hunting for new mon
kern  :info  : [  +0.000312] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [ +30.721964] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
kern  :info  : [  +0.000730] libceph: mon0 10.10.10.251:6789 session established                                                                
kern  :info  : [Oct23 14:44] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon
kern  :info  : [  +0.000330] libceph: mon1 10.10.10.252:6789 session established                                                                
kern  :info  : [ +30.721899] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
kern  :info  : [  +0.000951] libceph: mon0 10.10.10.251:6789 session established                                                                
kern  :info  : [Oct23 14:45] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon
kern  :info  : [  +0.000733] libceph: mon2 10.10.10.253:6789 session established                                                                
kern  :info  : [ +30.721529] libceph: mon2 10.10.10.253:6789 session lost, hunting for new mon
kern  :info  : [  +0.000328] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [Oct23 14:46] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
kern  :info  : [  +0.001035] libceph: mon0 10.10.10.251:6789 session established
kern  :info  : [ +30.721183] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon
kern  :info  : [  +0.004221] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [Oct23 14:47] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
kern  :info  : [  +0.000927] libceph: mon0 10.10.10.251:6789 session established
kern  :info  : [ +30.721361] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon
kern  :info  : [  +0.000524] libceph: mon1 10.10.10.252:6789 session established

and that is going on all the day.

In ceph -w I get

2017-10-23 14:51:57.941131 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 14:56:57.941433 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 14:56:58.124457 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:00:00.000184 mon.pve-hs-main [INF] overall HEALTH_OK
2017-10-23 15:01:57.941312 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:01:57.941558 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 15:06:57.941420 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:06:57.941544 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 15:11:57.941573 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:11:57.941659 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0

pve-hs-main is the host with ip 10.10.10.251

Actually CEPH storage is very low on usage, on average 200 kB/s read or write (as shown with ceph -s) so I don't think it's a problem about load average of the cluster.

The strange is that I see mon1 10.10.10.252:6789 session lost and that's from log of node 10.10.10.252 so it's losing connection with the monitor on the same node, I don't think it's network related.

I already tried with nodes reboot, ceph-mon and ceph-mgr restart, but the problem is still there.

Any ideas?

Thanks




--
Marco Baldini
H.S. Amiata Srl
Ufficio:   0577-779396
Cellulare:   335-8765169
WEB:   www.hsamiata.it
EMAIL:   mbaldini@xxxxxxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux