Re: Continuous error: "libceph: monX session lost, hunting for new mon" on one host

Alwin Antreich <a.antreich@xxxxxxxxxxx> · Mon, 23 Oct 2017 16:26:40 +0200

Hi Marco,

On Mon, Oct 23, 2017 at 04:10:34PM +0200, Marco Baldini - H.S. Amiata wrote:
> Thanks for reply
>
> My ceph.conf:
>
>    [global]
>              auth client required = none
>              auth cluster required = none
>              auth service required = none
>              bluestore_block_db_size = 64424509440
>              *cluster network = 10.10.10.0/24*
>              fsid = 24d5d6bc-0943-4345-b44e-46c19099004b
>              keyring = /etc/pve/priv/$cluster.$name.keyring
>              mon allow pool delete = true
>              osd journal size = 5120
>              osd pool default min size = 2
>              osd pool default size = 3
>              *public network = 10.10.10.0/24*
>
>    [client]
>              rbd cache = true
>              rbd cache max dirty = 134217728
>              rbd cache max dirty age = 2
>              rbd cache size = 268435456
>              rbd cache target dirty = 67108864
>              rbd cache writethrough until flush = true
>
>    [osd]
>              keyring = /var/lib/ceph/osd/ceph-$id/keyring
>
>    [mon.pve-hs-3]
>              host = pve-hs-3
>              mon addr = 10.10.10.253:6789
>
>    [mon.pve-hs-main]
>              host = pve-hs-main
>              mon addr = 10.10.10.251:6789
>
>    [mon.pve-hs-2]
>              host = pve-hs-2
>              mon addr = 10.10.10.252:6789
>
>
> Each node has two ethernet cards in LACP bond on network 10.10.10.x
>
> auto bond0
> iface bond0 inet static
>         address  10.10.10.252
>         netmask  255.255.255.0
>         slaves enp4s0 enp4s1
>         bond_miimon 100
>         bond_mode 802.3ad
>         bond_xmit_hash_policy layer3+4
> #CLUSTER BOND
>
>
> The LAG on switch (TPLink TL-SG2008) is enabled, I see from "show run"
>
> #
> interface gigabitEthernet 1/0/1
>
>   channel-group 4 mode active
> #
> interface gigabitEthernet 1/0/2
>
>   channel-group 4 mode active
> #
> interface gigabitEthernet 1/0/3
>
>   channel-group 2 mode active
> #
> interface gigabitEthernet 1/0/4
>
>   channel-group 2 mode active
> #
> interface gigabitEthernet 1/0/5
>
>   channel-group 3 mode active
> #
> interface gigabitEthernet 1/0/6
>
>   channel-group 3 mode active
> #
> interface gigabitEthernet 1/0/7
>
> #
> interface gigabitEthernet 1/0/8
>
>
> Node 1 is on port 1 and 2, node 2 on port 3 and 4, node 3 on port 5 and 6
>
>
> Routing table, show with "ip -4 route show  table all"
>
> default via 192.168.2.1 dev vmbr0 onlink
> *10.10.10.0/24 dev bond0 proto kernel scope link src 10.10.10.252*
> 192.168.1.0/24 dev vmbr1 proto kernel scope link src 192.168.1.252 linkdown
> 192.168.2.0/24 dev vmbr0 proto kernel scope link src 192.168.2.252
> *broadcast 10.10.10.0 dev bond0 table local proto kernel scope link src
> 10.10.10.252*
> *local 10.10.10.252 dev bond0 table local proto kernel scope host src
> 10.10.10.252*
> *broadcast 10.10.10.255 dev bond0 table local proto kernel scope link src
> 10.10.10.252*
> broadcast 127.0.0.0 dev lo table local proto kernel scope link src 127.0.0.1
> local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1
> local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1
> broadcast 127.255.255.255 dev lo table local proto kernel scope link src 127.0.0.1
> broadcast 192.168.1.0 dev vmbr1 table local proto kernel scope link src 192.168.1.252 linkdown
> local 192.168.1.252 dev vmbr1 table local proto kernel scope host src 192.168.1.252
> broadcast 192.168.1.255 dev vmbr1 table local proto kernel scope link src 192.168.1.252 linkdown
> broadcast 192.168.2.0 dev vmbr0 table local proto kernel scope link src 192.168.2.252
> local 192.168.2.252 dev vmbr0 table local proto kernel scope host src 192.168.2.252
> broadcast 192.168.2.255 dev vmbr0 table local proto kernel scope link src 192.168.2.252
>
>
> Network configuration
>
> *$ ip -4 a*
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
>     inet 127.0.0.1/8 scope host lo
>        valid_lft forever preferred_lft forever
> 6: vmbr1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
>     inet 192.168.1.252/24 brd 192.168.1.255 scope global vmbr1
>        valid_lft forever preferred_lft forever
> *7: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue
> state UP group default qlen 1000****inet 10.10.10.252/24 brd 10.10.10.255
> scope global bond0****valid_lft forever preferred_lft forever***8: vmbr0:
> <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group
> default qlen 1000
>     inet 192.168.2.252/24 brd 192.168.2.255 scope global vmbr0
>        valid_lft forever preferred_lft forever
>
> *$ ip -4 link*
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UP mode DEFAULT group default qlen 1000
>     link/ether 40:8d:5c:b0:2d:fe brd ff:ff:ff:ff:ff:ff
> *3: enp4s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc
> pfifo_fast master bond0 state UP mode DEFAULT group default qlen
> 1000****link/ether 98:de:d0:1d:75:4a brd ff:ff:ff:ff:ff:ff****4: enp4s1:
> <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master
> bond0 state UP mode DEFAULT group default qlen 1000****link/ether
> 98:de:d0:1d:75:4a brd ff:ff:ff:ff:ff:ff***6: vmbr1:
> <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode
> DEFAULT group default qlen 1000
>     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
> *7: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue
> state UP mode DEFAULT group default qlen 1000****link/ether
> 98:de:d0:1d:75:4a brd ff:ff:ff:ff:ff:ff***8: vmbr0:
> <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode
> DEFAULT group default qlen 1000
>     link/ether 40:8d:5c:b0:2d:fe brd ff:ff:ff:ff:ff:ff
> 9: tap104i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UNKNOWN mode DEFAULT group default qlen 1000
>     link/ether b2:47:55:9f:d3:0b brd ff:ff:ff:ff:ff:ff
> 11: veth103i0@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP mode DEFAULT group default qlen 1000
>     link/ether fe:03:27:0d:02:38 brd ff:ff:ff:ff:ff:ff link-netnsid 0
> 13: veth106i0@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP mode DEFAULT group default qlen 1000
>     link/ether fe:ce:4f:09:24:45 brd ff:ff:ff:ff:ff:ff link-netnsid 1
> 14: tap109i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UNKNOWN mode DEFAULT group default qlen 1000
>     link/ether 3a:f0:99:3f:6a:75 brd ff:ff:ff:ff:ff:ff
> 15: tap201i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UNKNOWN mode DEFAULT group default qlen 1000
>     link/ether 16:99:8a:56:6d:7f brd ff:ff:ff:ff:ff:ff
>
>
> I think it's everything.
>
> Thanks
>
>
>
> Il 23/10/2017 15:42, Denes Dolhay ha scritto:
> >
> > Hi,
> >
> > Maybe some routing issue?
> >
> >
> > "CEPH has public and cluster network on 10.10.10.0/24"
> >
> > This means that the nodes have public and cluster network separately
> > both on 10.10.10.0/24, or that you did not specify a separate cluster
> > network?
> >
> > Please provide route table, ifconfig, ceph.conf
> >
> >
> > Regards,
> >
> > Denes
> >
> >
> > On 10/23/2017 03:35 PM, Marco Baldini - H.S. Amiata wrote:
> > >
> > > Hello
> > >
> > > I have a CEPH cluster with 3 nodes, each with 3 OSDs, running
> > > Proxmox, CEPH  versions:
> > >
> > > {
> > >      "mon": {
> > >          "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 3
> > >      },
> > >      "mgr": {
> > >          "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 3
> > >      },
> > >      "osd": {
> > >          "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 9
> > >      },
> > >      "mds": {},
> > >      "overall": {
> > >          "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 15
> > >      }
> > > }
> > >
> > > CEPH has public and cluster network on 10.10.10.0/24, the three
> > > nodes are 10.10.10.251, 10.10.10.252, 10.10.10.253 and networking is
> > > working good (I kept ping from one of the nodes to the others two
> > > running for hours and had 0 packet loss)
> > >
> > > On one node with ip 10.10.10.252 I get strange message in dmesg
> > >
> > > kern  :info  : [Oct23 14:42] libceph: mon2 10.10.10.253:6789 session lost, hunting for new mon
> > > kern  :info  : [  +0.000391] libceph: mon1 10.10.10.252:6789 session established
> > > kern  :info  : [ +30.721869] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
> > > kern  :info  : [  +0.000749] libceph: mon2 10.10.10.253:6789 session established
> > > kern  :info  : [Oct23 14:43] libceph: mon2 10.10.10.253:6789 session lost, hunting for new mon
> > > kern  :info  : [  +0.000312] libceph: mon1 10.10.10.252:6789 session established
> > > kern  :info  : [ +30.721964] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
> > > kern  :info  : [  +0.000730] libceph: mon0 10.10.10.251:6789 session established
> > > kern  :info  : [Oct23 14:44] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon
> > > kern  :info  : [  +0.000330] libceph: mon1 10.10.10.252:6789 session established
> > > kern  :info  : [ +30.721899] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
> > > kern  :info  : [  +0.000951] libceph: mon0 10.10.10.251:6789 session established
> > > kern  :info  : [Oct23 14:45] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon
> > > kern  :info  : [  +0.000733] libceph: mon2 10.10.10.253:6789 session established
> > > kern  :info  : [ +30.721529] libceph: mon2 10.10.10.253:6789 session lost, hunting for new mon
> > > kern  :info  : [  +0.000328] libceph: mon1 10.10.10.252:6789 session established
> > > kern  :info  : [Oct23 14:46] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
> > > kern  :info  : [  +0.001035] libceph: mon0 10.10.10.251:6789 session established
> > > kern  :info  : [ +30.721183] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon
> > > kern  :info  : [  +0.004221] libceph: mon1 10.10.10.252:6789 session established
> > > kern  :info  : [Oct23 14:47] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon
> > > kern  :info  : [  +0.000927] libceph: mon0 10.10.10.251:6789 session established
> > > kern  :info  : [ +30.721361] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon
> > > kern  :info  : [  +0.000524] libceph: mon1 10.10.10.252:6789 session established
> > >
> > > and that is going on all the day.
> > >
> > > In ceph -w I get
> > >
> > > 2017-10-23 14:51:57.941131 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
> > > 2017-10-23 14:56:57.941433 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
> > > 2017-10-23 14:56:58.124457 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
> > > 2017-10-23 15:00:00.000184 mon.pve-hs-main [INF] overall HEALTH_OK
> > > 2017-10-23 15:01:57.941312 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
> > > 2017-10-23 15:01:57.941558 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
> > > 2017-10-23 15:06:57.941420 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
> > > 2017-10-23 15:06:57.941544 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
> > > 2017-10-23 15:11:57.941573 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0
> > > 2017-10-23 15:11:57.941659 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0
> > >
> > > pve-hs-main is the host with ip 10.10.10.251
> > >
> > > Actually CEPH storage is very low on usage, on average 200 kB/s read
> > > or write (as shown with ceph -s) so I don't think it's a problem
> > > about load average of the cluster.
> > >
> > > The strange is that I see mon1 10.10.10.252:6789 session lost and
> > > that's from log of node 10.10.10.252 so it's losing connection with
> > > the monitor on the same node, I don't think it's network related.
> > >
> > > I already tried with nodes reboot, ceph-mon and ceph-mgr restart,
> > > but the problem is still there.
> > >
> > > Any ideas?
> > >
> > > Thanks
> > >
> > >
> > >
> > >
> > > --
> > > *Marco Baldini*
> > > *H.S. Amiata Srl*
> > > Ufficio: 	0577-779396
> > > Cellulare: 	335-8765169
> > > WEB: 	www.hsamiata.it <https://www.hsamiata.it>
> > > EMAIL: 	mbaldini@xxxxxxxxxxx <mailto:mbaldini@xxxxxxxxxxx>
> > >
> > >
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> --
> *Marco Baldini*
> *H.S. Amiata Srl*
> Ufficio: 	0577-779396
> Cellulare: 	335-8765169
> WEB: 	www.hsamiata.it <https://www.hsamiata.it>
> EMAIL: 	mbaldini@xxxxxxxxxxx <mailto:mbaldini@xxxxxxxxxxx>
>

> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Does the ceph-mon services restart when the session is lost?
What do you see in the ceph-mon.log on the failing mon node?

--
Cheers,
Alwin

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com