Re: Continuous error: "libceph: monX session lost, hunting for new mon" on one host

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks for reply

My ceph.conf:

[global]
         auth client required = none
         auth cluster required = none
         auth service required = none
         bluestore_block_db_size = 64424509440
         cluster network = 10.10.10.0/24
         fsid = 24d5d6bc-0943-4345-b44e-46c19099004b
         keyring = /etc/pve/priv/$cluster.$name.keyring
         mon allow pool delete = true
         osd journal size = 5120
         osd pool default min size = 2
         osd pool default size = 3
         public network = 10.10.10.0/24

[client]
         rbd cache = true
         rbd cache max dirty = 134217728
         rbd cache max dirty age = 2
         rbd cache size = 268435456
         rbd cache target dirty = 67108864
         rbd cache writethrough until flush = true

[osd]
         keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.pve-hs-3]
         host = pve-hs-3
         mon addr = 10.10.10.253:6789

[mon.pve-hs-main]
         host = pve-hs-main
         mon addr = 10.10.10.251:6789

[mon.pve-hs-2]
         host = pve-hs-2
         mon addr = 10.10.10.252:6789


Each node has two ethernet cards in LACP bond on network 10.10.10.x

auto bond0
iface bond0 inet static
        address  10.10.10.252
        netmask  255.255.255.0
        slaves enp4s0 enp4s1
        bond_miimon 100
        bond_mode 802.3ad
        bond_xmit_hash_policy layer3+4
#CLUSTER BOND


The LAG on switch (TPLink TL-SG2008) is enabled, I see from "show run"

#
interface gigabitEthernet 1/0/1

  channel-group 4 mode active
#
interface gigabitEthernet 1/0/2

  channel-group 4 mode active
#
interface gigabitEthernet 1/0/3

  channel-group 2 mode active
#
interface gigabitEthernet 1/0/4

  channel-group 2 mode active
#
interface gigabitEthernet 1/0/5

  channel-group 3 mode active
#
interface gigabitEthernet 1/0/6

  channel-group 3 mode active
#
interface gigabitEthernet 1/0/7

#
interface gigabitEthernet 1/0/8


Node 1 is on port 1 and 2, node 2 on port 3 and 4, node 3 on port 5 and 6


Routing table, show with "ip -4 route show  table all"

default via 192.168.2.1 dev vmbr0 onlink
10.10.10.0/24 dev bond0 proto kernel scope link src 10.10.10.252
192.168.1.0/24 dev vmbr1 proto kernel scope link src 192.168.1.252 linkdown
192.168.2.0/24 dev vmbr0 proto kernel scope link src 192.168.2.252
broadcast 10.10.10.0 dev bond0 table local proto kernel scope link src 10.10.10.252
local 10.10.10.252 dev bond0 table local proto kernel scope host src 10.10.10.252
broadcast 10.10.10.255 dev bond0 table local proto kernel scope link src 10.10.10.252
broadcast 127.0.0.0 dev lo table local proto kernel scope link src 127.0.0.1
local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1
local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1
broadcast 127.255.255.255 dev lo table local proto kernel scope link src 127.0.0.1
broadcast 192.168.1.0 dev vmbr1 table local proto kernel scope link src 192.168.1.252 linkdown
local 192.168.1.252 dev vmbr1 table local proto kernel scope host src 192.168.1.252
broadcast 192.168.1.255 dev vmbr1 table local proto kernel scope link src 192.168.1.252 linkdown
broadcast 192.168.2.0 dev vmbr0 table local proto kernel scope link src 192.168.2.252
local 192.168.2.252 dev vmbr0 table local proto kernel scope host src 192.168.2.252
broadcast 192.168.2.255 dev vmbr0 table local proto kernel scope link src 192.168.2.252


Network configuration

$ ip -4 a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
6: vmbr1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    inet 192.168.1.252/24 brd 192.168.1.255 scope global vmbr1
       valid_lft forever preferred_lft forever
7: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    inet 10.10.10.252/24 brd 10.10.10.255 scope global bond0
       valid_lft forever preferred_lft forever
8: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    inet 192.168.2.252/24 brd 192.168.2.255 scope global vmbr0
       valid_lft forever preferred_lft forever

$ ip -4 link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UP mode DEFAULT group default qlen 1000
    link/ether 40:8d:5c:b0:2d:fe brd ff:ff:ff:ff:ff:ff
3: enp4s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether 98:de:d0:1d:75:4a brd ff:ff:ff:ff:ff:ff
4: enp4s1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether 98:de:d0:1d:75:4a brd ff:ff:ff:ff:ff:ff
6: vmbr1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
7: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 98:de:d0:1d:75:4a brd ff:ff:ff:ff:ff:ff
8: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 40:8d:5c:b0:2d:fe brd ff:ff:ff:ff:ff:ff
9: tap104i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether b2:47:55:9f:d3:0b brd ff:ff:ff:ff:ff:ff
11: veth103i0@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP mode DEFAULT group default qlen 1000
    link/ether fe:03:27:0d:02:38 brd ff:ff:ff:ff:ff:ff link-netnsid 0
13: veth106i0@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP mode DEFAULT group default qlen 1000
    link/ether fe:ce:4f:09:24:45 brd ff:ff:ff:ff:ff:ff link-netnsid 1
14: tap109i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 3a:f0:99:3f:6a:75 brd ff:ff:ff:ff:ff:ff
15: tap201i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 16:99:8a:56:6d:7f brd ff:ff:ff:ff:ff:ff


I think it's everything.

Thanks



Il 23/10/2017 15:42, Denes Dolhay ha scritto:

Hi,

Maybe some routing issue?


"CEPH has public and cluster network on 10.10.10.0/24"

This means that the nodes have public and cluster network separately both on 10.10.10.0/24, or that you did not specify a separate cluster network?

Please provide route table, ifconfig, ceph.conf


Regards,

Denes


On 10/23/2017 03:35 PM, Marco Baldini - H.S. Amiata wrote:

Hello

I have a CEPH cluster with 3 nodes, each with 3 OSDs, running Proxmox, CEPH  versions:

{
    "mon": {
        "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 3
    },
    "mgr": {
        "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 3
    },
    "osd": {
        "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 9
    },
    "mds": {},
    "overall": {
        "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 15
    }
}

CEPH has public and cluster network on 10.10.10.0/24, the three nodes are 10.10.10.251, 10.10.10.252, 10.10.10.253 and networking is working good (I kept ping from one of the nodes to the others two running for hours and had 0 packet loss)

On one node with ip 10.10.10.252 I get strange message in dmesg

kern  :info  : [Oct23 14:42] libceph: mon2 10.10.10.253:6789 session lost, hunting for new mon
kern  :info  : [  +0.000391] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [ +30.721869] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
kern  :info  : [  +0.000749] libceph: mon2 10.10.10.253:6789 session established
kern  :info  : [Oct23 14:43] libceph: mon2 10.10.10.253:6789 session lost, hunting for new mon
kern  :info  : [  +0.000312] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [ +30.721964] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
kern  :info  : [  +0.000730] libceph: mon0 10.10.10.251:6789 session established                                                                
kern  :info  : [Oct23 14:44] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon
kern  :info  : [  +0.000330] libceph: mon1 10.10.10.252:6789 session established                                                                
kern  :info  : [ +30.721899] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
kern  :info  : [  +0.000951] libceph: mon0 10.10.10.251:6789 session established                                                                
kern  :info  : [Oct23 14:45] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon
kern  :info  : [  +0.000733] libceph: mon2 10.10.10.253:6789 session established                                                                
kern  :info  : [ +30.721529] libceph: mon2 10.10.10.253:6789 session lost, hunting for new mon
kern  :info  : [  +0.000328] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [Oct23 14:46] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
kern  :info  : [  +0.001035] libceph: mon0 10.10.10.251:6789 session established
kern  :info  : [ +30.721183] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon
kern  :info  : [  +0.004221] libceph: mon1 10.10.10.252:6789 session established
kern  :info  : [Oct23 14:47] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
kern  :info  : [  +0.000927] libceph: mon0 10.10.10.251:6789 session established
kern  :info  : [ +30.721361] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon
kern  :info  : [  +0.000524] libceph: mon1 10.10.10.252:6789 session established

and that is going on all the day.

In ceph -w I get

2017-10-23 14:51:57.941131 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 14:56:57.941433 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 14:56:58.124457 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:00:00.000184 mon.pve-hs-main [INF] overall HEALTH_OK
2017-10-23 15:01:57.941312 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:01:57.941558 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 15:06:57.941420 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:06:57.941544 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
2017-10-23 15:11:57.941573 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
2017-10-23 15:11:57.941659 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0

pve-hs-main is the host with ip 10.10.10.251

Actually CEPH storage is very low on usage, on average 200 kB/s read or write (as shown with ceph -s) so I don't think it's a problem about load average of the cluster.

The strange is that I see mon1 10.10.10.252:6789 session lost and that's from log of node 10.10.10.252 so it's losing connection with the monitor on the same node, I don't think it's network related.

I already tried with nodes reboot, ceph-mon and ceph-mgr restart, but the problem is still there.

Any ideas?

Thanks




--
Marco Baldini
H.S. Amiata Srl
Ufficio:   0577-779396
Cellulare:   335-8765169
WEB:   www.hsamiata.it
EMAIL:   mbaldini@xxxxxxxxxxx


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Marco Baldini
H.S. Amiata Srl
Ufficio:   0577-779396
Cellulare:   335-8765169
WEB:   www.hsamiata.it
EMAIL:   mbaldini@xxxxxxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux