Hi Marco, On Mon, Oct 23, 2017 at 04:10:34PM +0200, Marco Baldini - H.S. Amiata wrote: > Thanks for reply > > My ceph.conf: > > [global] > auth client required = none > auth cluster required = none > auth service required = none > bluestore_block_db_size = 64424509440 > *cluster network = 10.10.10.0/24* > fsid = 24d5d6bc-0943-4345-b44e-46c19099004b > keyring = /etc/pve/priv/$cluster.$name.keyring > mon allow pool delete = true > osd journal size = 5120 > osd pool default min size = 2 > osd pool default size = 3 > *public network = 10.10.10.0/24* > > [client] > rbd cache = true > rbd cache max dirty = 134217728 > rbd cache max dirty age = 2 > rbd cache size = 268435456 > rbd cache target dirty = 67108864 > rbd cache writethrough until flush = true > > [osd] > keyring = /var/lib/ceph/osd/ceph-$id/keyring > > [mon.pve-hs-3] > host = pve-hs-3 > mon addr = 10.10.10.253:6789 > > [mon.pve-hs-main] > host = pve-hs-main > mon addr = 10.10.10.251:6789 > > [mon.pve-hs-2] > host = pve-hs-2 > mon addr = 10.10.10.252:6789 > > > Each node has two ethernet cards in LACP bond on network 10.10.10.x > > auto bond0 > iface bond0 inet static > address 10.10.10.252 > netmask 255.255.255.0 > slaves enp4s0 enp4s1 > bond_miimon 100 > bond_mode 802.3ad > bond_xmit_hash_policy layer3+4 > #CLUSTER BOND > > > The LAG on switch (TPLink TL-SG2008) is enabled, I see from "show run" > > # > interface gigabitEthernet 1/0/1 > > channel-group 4 mode active > # > interface gigabitEthernet 1/0/2 > > channel-group 4 mode active > # > interface gigabitEthernet 1/0/3 > > channel-group 2 mode active > # > interface gigabitEthernet 1/0/4 > > channel-group 2 mode active > # > interface gigabitEthernet 1/0/5 > > channel-group 3 mode active > # > interface gigabitEthernet 1/0/6 > > channel-group 3 mode active > # > interface gigabitEthernet 1/0/7 > > # > interface gigabitEthernet 1/0/8 > > > Node 1 is on port 1 and 2, node 2 on port 3 and 4, node 3 on port 5 and 6 > > > Routing table, show with "ip -4 route show table all" > > default via 192.168.2.1 dev vmbr0 onlink > *10.10.10.0/24 dev bond0 proto kernel scope link src 10.10.10.252* > 192.168.1.0/24 dev vmbr1 proto kernel scope link src 192.168.1.252 linkdown > 192.168.2.0/24 dev vmbr0 proto kernel scope link src 192.168.2.252 > *broadcast 10.10.10.0 dev bond0 table local proto kernel scope link src > 10.10.10.252* > *local 10.10.10.252 dev bond0 table local proto kernel scope host src > 10.10.10.252* > *broadcast 10.10.10.255 dev bond0 table local proto kernel scope link src > 10.10.10.252* > broadcast 127.0.0.0 dev lo table local proto kernel scope link src 127.0.0.1 > local 127.0.0.0/8 dev lo table local proto kernel scope host src 127.0.0.1 > local 127.0.0.1 dev lo table local proto kernel scope host src 127.0.0.1 > broadcast 127.255.255.255 dev lo table local proto kernel scope link src 127.0.0.1 > broadcast 192.168.1.0 dev vmbr1 table local proto kernel scope link src 192.168.1.252 linkdown > local 192.168.1.252 dev vmbr1 table local proto kernel scope host src 192.168.1.252 > broadcast 192.168.1.255 dev vmbr1 table local proto kernel scope link src 192.168.1.252 linkdown > broadcast 192.168.2.0 dev vmbr0 table local proto kernel scope link src 192.168.2.252 > local 192.168.2.252 dev vmbr0 table local proto kernel scope host src 192.168.2.252 > broadcast 192.168.2.255 dev vmbr0 table local proto kernel scope link src 192.168.2.252 > > > Network configuration > > *$ ip -4 a* > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 > inet 127.0.0.1/8 scope host lo > valid_lft forever preferred_lft forever > 6: vmbr1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000 > inet 192.168.1.252/24 brd 192.168.1.255 scope global vmbr1 > valid_lft forever preferred_lft forever > *7: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue > state UP group default qlen 1000****inet 10.10.10.252/24 brd 10.10.10.255 > scope global bond0****valid_lft forever preferred_lft forever***8: vmbr0: > <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group > default qlen 1000 > inet 192.168.2.252/24 brd 192.168.2.255 scope global vmbr0 > valid_lft forever preferred_lft forever > > *$ ip -4 link* > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > 2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UP mode DEFAULT group default qlen 1000 > link/ether 40:8d:5c:b0:2d:fe brd ff:ff:ff:ff:ff:ff > *3: enp4s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc > pfifo_fast master bond0 state UP mode DEFAULT group default qlen > 1000****link/ether 98:de:d0:1d:75:4a brd ff:ff:ff:ff:ff:ff****4: enp4s1: > <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master > bond0 state UP mode DEFAULT group default qlen 1000****link/ether > 98:de:d0:1d:75:4a brd ff:ff:ff:ff:ff:ff***6: vmbr1: > <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode > DEFAULT group default qlen 1000 > link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff > *7: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue > state UP mode DEFAULT group default qlen 1000****link/ether > 98:de:d0:1d:75:4a brd ff:ff:ff:ff:ff:ff***8: vmbr0: > <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode > DEFAULT group default qlen 1000 > link/ether 40:8d:5c:b0:2d:fe brd ff:ff:ff:ff:ff:ff > 9: tap104i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UNKNOWN mode DEFAULT group default qlen 1000 > link/ether b2:47:55:9f:d3:0b brd ff:ff:ff:ff:ff:ff > 11: veth103i0@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP mode DEFAULT group default qlen 1000 > link/ether fe:03:27:0d:02:38 brd ff:ff:ff:ff:ff:ff link-netnsid 0 > 13: veth106i0@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vmbr0 state UP mode DEFAULT group default qlen 1000 > link/ether fe:ce:4f:09:24:45 brd ff:ff:ff:ff:ff:ff link-netnsid 1 > 14: tap109i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UNKNOWN mode DEFAULT group default qlen 1000 > link/ether 3a:f0:99:3f:6a:75 brd ff:ff:ff:ff:ff:ff > 15: tap201i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UNKNOWN mode DEFAULT group default qlen 1000 > link/ether 16:99:8a:56:6d:7f brd ff:ff:ff:ff:ff:ff > > > I think it's everything. > > Thanks > > > > Il 23/10/2017 15:42, Denes Dolhay ha scritto: > > > > Hi, > > > > Maybe some routing issue? > > > > > > "CEPH has public and cluster network on 10.10.10.0/24" > > > > This means that the nodes have public and cluster network separately > > both on 10.10.10.0/24, or that you did not specify a separate cluster > > network? > > > > Please provide route table, ifconfig, ceph.conf > > > > > > Regards, > > > > Denes > > > > > > On 10/23/2017 03:35 PM, Marco Baldini - H.S. Amiata wrote: > > > > > > Hello > > > > > > I have a CEPH cluster with 3 nodes, each with 3 OSDs, running > > > Proxmox, CEPH versions: > > > > > > { > > > "mon": { > > > "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 3 > > > }, > > > "mgr": { > > > "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 3 > > > }, > > > "osd": { > > > "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 9 > > > }, > > > "mds": {}, > > > "overall": { > > > "ceph version 12.2.1 (1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 15 > > > } > > > } > > > > > > CEPH has public and cluster network on 10.10.10.0/24, the three > > > nodes are 10.10.10.251, 10.10.10.252, 10.10.10.253 and networking is > > > working good (I kept ping from one of the nodes to the others two > > > running for hours and had 0 packet loss) > > > > > > On one node with ip 10.10.10.252 I get strange message in dmesg > > > > > > kern :info : [Oct23 14:42] libceph: mon2 10.10.10.253:6789 session lost, hunting for new mon > > > kern :info : [ +0.000391] libceph: mon1 10.10.10.252:6789 session established > > > kern :info : [ +30.721869] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon > > > kern :info : [ +0.000749] libceph: mon2 10.10.10.253:6789 session established > > > kern :info : [Oct23 14:43] libceph: mon2 10.10.10.253:6789 session lost, hunting for new mon > > > kern :info : [ +0.000312] libceph: mon1 10.10.10.252:6789 session established > > > kern :info : [ +30.721964] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon > > > kern :info : [ +0.000730] libceph: mon0 10.10.10.251:6789 session established > > > kern :info : [Oct23 14:44] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon > > > kern :info : [ +0.000330] libceph: mon1 10.10.10.252:6789 session established > > > kern :info : [ +30.721899] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon > > > kern :info : [ +0.000951] libceph: mon0 10.10.10.251:6789 session established > > > kern :info : [Oct23 14:45] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon > > > kern :info : [ +0.000733] libceph: mon2 10.10.10.253:6789 session established > > > kern :info : [ +30.721529] libceph: mon2 10.10.10.253:6789 session lost, hunting for new mon > > > kern :info : [ +0.000328] libceph: mon1 10.10.10.252:6789 session established > > > kern :info : [Oct23 14:46] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon > > > kern :info : [ +0.001035] libceph: mon0 10.10.10.251:6789 session established > > > kern :info : [ +30.721183] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon > > > kern :info : [ +0.004221] libceph: mon1 10.10.10.252:6789 session established > > > kern :info : [Oct23 14:47] libceph: mon1 10.10.10.252:6789 session lost, hunting for new mon > > > kern :info : [ +0.000927] libceph: mon0 10.10.10.251:6789 session established > > > kern :info : [ +30.721361] libceph: mon0 10.10.10.251:6789 session lost, hunting for new mon > > > kern :info : [ +0.000524] libceph: mon1 10.10.10.252:6789 session established > > > > > > and that is going on all the day. > > > > > > In ceph -w I get > > > > > > 2017-10-23 14:51:57.941131 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0 > > > 2017-10-23 14:56:57.941433 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0 > > > 2017-10-23 14:56:58.124457 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0 > > > 2017-10-23 15:00:00.000184 mon.pve-hs-main [INF] overall HEALTH_OK > > > 2017-10-23 15:01:57.941312 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0 > > > 2017-10-23 15:01:57.941558 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0 > > > 2017-10-23 15:06:57.941420 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0 > > > 2017-10-23 15:06:57.941544 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0 > > > 2017-10-23 15:11:57.941573 mon.pve-hs-main [INF] mon.1 10.10.10.252:6789/0 > > > 2017-10-23 15:11:57.941659 mon.pve-hs-main [INF] mon.2 10.10.10.253:6789/0 > > > > > > pve-hs-main is the host with ip 10.10.10.251 > > > > > > Actually CEPH storage is very low on usage, on average 200 kB/s read > > > or write (as shown with ceph -s) so I don't think it's a problem > > > about load average of the cluster. > > > > > > The strange is that I see mon1 10.10.10.252:6789 session lost and > > > that's from log of node 10.10.10.252 so it's losing connection with > > > the monitor on the same node, I don't think it's network related. > > > > > > I already tried with nodes reboot, ceph-mon and ceph-mgr restart, > > > but the problem is still there. > > > > > > Any ideas? > > > > > > Thanks > > > > > > > > > > > > > > > -- > > > *Marco Baldini* > > > *H.S. Amiata Srl* > > > Ufficio: 0577-779396 > > > Cellulare: 335-8765169 > > > WEB: www.hsamiata.it <https://www.hsamiata.it> > > > EMAIL: mbaldini@xxxxxxxxxxx <mailto:mbaldini@xxxxxxxxxxx> > > > > > > > > > > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- > *Marco Baldini* > *H.S. Amiata Srl* > Ufficio: 0577-779396 > Cellulare: 335-8765169 > WEB: www.hsamiata.it <https://www.hsamiata.it> > EMAIL: mbaldini@xxxxxxxxxxx <mailto:mbaldini@xxxxxxxxxxx> > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Does the ceph-mon services restart when the session is lost? What do you see in the ceph-mon.log on the failing mon node? -- Cheers, Alwin _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com