Continuous error: "libceph: monX session lost, hunting for new mon" on one host

"Marco Baldini - H.S. Amiata" <mbaldini@xxxxxxxxxxx> · Mon, 23 Oct 2017 15:35:21 +0200



    Hello
    I have a CEPH cluster with 3 nodes, each with 3 OSDs, running
      Proxmox, CEPH  versions:
    {
    "mon": {
        "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 3
    },
    "mgr": {
        "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 3
    },
    "osd": {
        "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 9
    },
    "mds": {},
    "overall": {
        "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 15
    }
}


    CEPH has public and cluster network on 10.10.10.0/24, the three
      nodes are 10.10.10.251, 10.10.10.252, 10.10.10.253 and networking
      is working good (I kept ping from one of the nodes to the others
      two running for hours and had 0 packet loss)

    
    On one node with ip 10.10.10.252 I get strange message in dmesg
    kern  :info  : [Oct23 14:42] libceph: mon2 10.10.10.253:6789 session lost, hunting for new mon
kern  :info  : [  +0.000391] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [ +30.721869] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
kern  :info  : [  +0.000749] libceph: mon2 10.10.10.253:6789 session established
kern  :info  : [Oct23 14:43] libceph: mon2 10.10.10.253:6789 session lost, hunting for new mon
kern  :info  : [  +0.000312] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [ +30.721964] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
kern  :info  : [  +0.000730] libceph: mon0 10.10.10.251:6789 session established                                                                
kern  :info  : [Oct23 14:44] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon
kern  :info  : [  +0.000330] libceph: mon1 10.10.10.252:6789 session established                                                                
kern  :info  : [ +30.721899] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
kern  :info  : [  +0.000951] libceph: mon0 10.10.10.251:6789 session established                                                                
kern  :info  : [Oct23 14:45] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon
kern  :info  : [  +0.000733] libceph: mon2 10.10.10.253:6789 session established                                                                
kern  :info  : [ +30.721529] libceph: mon2 10.10.10.253:6789 session lost, hunting for new mon
kern  :info  : [  +0.000328] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [Oct23 14:46] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
kern  :info  : [  +0.001035] libceph: mon0 10.10.10.251:6789 session established
kern  :info  : [ +30.721183] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon
kern  :info  : [  +0.004221] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [Oct23 14:47] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
kern  :info  : [  +0.000927] libceph: mon0 10.10.10.251:6789 session established
kern  :info  : [ +30.721361] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon
kern  :info  : [  +0.000524] libceph: mon1 10.10.10.252:6789 session established


    and that is going on all the day.
    In ceph -w I get
    2017-10-23 14:51:57.941131 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 14:56:57.941433 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 14:56:58.124457 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:00:00.000184 mon.pve-hs-main [INF] overall HEALTH_OK
2017-10-23 15:01:57.941312 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:01:57.941558 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 15:06:57.941420 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:06:57.941544 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 15:11:57.941573 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:11:57.941659 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0


    pve-hs-main is the host with ip 10.10.10.251

    
    Actually CEPH storage is very low on usage, on average 200 kB/s
      read or write (as shown with ceph -s) so I don't think it's a
      problem about load average of the cluster.
    The strange is that I see mon1 10.10.10.252:6789 session lost and
      that's from log of node 10.10.10.252 so it's losing connection
      with the monitor on the same node, I don't think it's network
      related.
    I already tried with nodes reboot, ceph-mon and ceph-mgr restart,
      but the problem is still there.
    Any ideas? 

    
    Thanks

    
    -- 

      
            Marco Baldini
          
          
            H.S. Amiata Srl
          
          
            Ufficio:  
            0577-779396
          
          
            Cellulare:  
            335-8765169
          
          
            WEB:  
            www.hsamiata.it
          
          
            EMAIL:  
            mbaldini@xxxxxxxxxxx
          
        
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com