Re: Continuous error: "libceph: monX session lost, hunting for new mon" on one host

Denes Dolhay <denke@xxxxxxxxxxxx> · Mon, 23 Oct 2017 15:42:41 +0200



    Hi,
    Maybe some routing issue?
    

    "CEPH has public and cluster network on 10.10.10.0/24"
    This means that the nodes have public and cluster network
      separately both on 10.10.10.0/24, or that you did not specify a
      separate cluster network?
    Please provide route table, ifconfig, ceph.conf
    

    Regards,

    
    Denes

    
    On 10/23/2017 03:35 PM, Marco Baldini -
      H.S. Amiata wrote:

    
      Hello
      I have a CEPH cluster with 3 nodes, each with 3 OSDs, running
        Proxmox, CEPH  versions:
      {
    "mon": {
        "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 3
    },
    "mgr": {
        "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 3
    },
    "osd": {
        "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 9
    },
    "mds": {},
    "overall": {
        "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 15
    }
}


      CEPH has public and cluster network on 10.10.10.0/24, the three
        nodes are 10.10.10.251, 10.10.10.252, 10.10.10.253 and
        networking is working good (I kept ping from one of the nodes to
        the others two running for hours and had 0 packet loss)

      
      On one node with ip 10.10.10.252 I get strange message in dmesg
      kern  :info  : [Oct23 14:42] libceph: mon2 10.10.10.253:6789 session lost, hunting for new mon
kern  :info  : [  +0.000391] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [ +30.721869] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
kern  :info  : [  +0.000749] libceph: mon2 10.10.10.253:6789 session established
kern  :info  : [Oct23 14:43] libceph: mon2 10.10.10.253:6789 session lost, hunting for new mon
kern  :info  : [  +0.000312] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [ +30.721964] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
kern  :info  : [  +0.000730] libceph: mon0 10.10.10.251:6789 session established                                                                
kern  :info  : [Oct23 14:44] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon
kern  :info  : [  +0.000330] libceph: mon1 10.10.10.252:6789 session established                                                                
kern  :info  : [ +30.721899] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
kern  :info  : [  +0.000951] libceph: mon0 10.10.10.251:6789 session established                                                                
kern  :info  : [Oct23 14:45] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon
kern  :info  : [  +0.000733] libceph: mon2 10.10.10.253:6789 session established                                                                
kern  :info  : [ +30.721529] libceph: mon2 10.10.10.253:6789 session lost, hunting for new mon
kern  :info  : [  +0.000328] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [Oct23 14:46] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
kern  :info  : [  +0.001035] libceph: mon0 10.10.10.251:6789 session established
kern  :info  : [ +30.721183] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon
kern  :info  : [  +0.004221] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [Oct23 14:47] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
kern  :info  : [  +0.000927] libceph: mon0 10.10.10.251:6789 session established
kern  :info  : [ +30.721361] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon
kern  :info  : [  +0.000524] libceph: mon1 10.10.10.252:6789 session established


      and that is going on all the day.
      In ceph -w I get
      2017-10-23 14:51:57.941131 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 14:56:57.941433 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 14:56:58.124457 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:00:00.000184 mon.pve-hs-main [INF] overall HEALTH_OK
2017-10-23 15:01:57.941312 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:01:57.941558 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 15:06:57.941420 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:06:57.941544 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 15:11:57.941573 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:11:57.941659 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0


      pve-hs-main is the host with ip 10.10.10.251

      
      Actually CEPH storage is very low on usage, on average 200 kB/s
        read or write (as shown with ceph -s) so I don't think it's a
        problem about load average of the cluster.
      The strange is that I see mon1 10.10.10.252:6789 session lost
        and that's from log of node 10.10.10.252 so it's losing
        connection with the monitor on the same node, I don't think it's
        network related.
      I already tried with nodes reboot, ceph-mon and ceph-mgr
        restart, but the problem is still there.
      Any ideas? 

      
      Thanks

      
      -- 

        
              Marco Baldini
            
            
              H.S. Amiata Srl
            
            
              Ufficio:  
              0577-779396
            
            
              Cellulare:  
              335-8765169
            
            
              WEB:  
              www.hsamiata.it
            
            
              EMAIL:  
              mbaldini@xxxxxxxxxxx
            
          
      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com