Thanks for reply
My ceph.conf:
[global]
auth client required = none
auth cluster required = none
auth service required = none
bluestore_block_db_size = 64424509440
*cluster network = 10.10.10.0/24*
fsid = 24d5d6bc-0943-4345-b44e-46c19099004b
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
*public network = 10.10.10.0/24*
[client]
rbd cache = true
rbd cache max dirty = 134217728
rbd cache max dirty age = 2
rbd cache size = 268435456
rbd cache target dirty = 67108864
rbd cache writethrough until flush = true
[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring
[mon.pve-hs-3]
host = pve-hs-3
mon addr = 10.10.10.253:6789
[mon.pve-hs-main]
host = pve-hs-main
mon addr = 10.10.10.251:6789
[mon.pve-hs-2]
host = pve-hs-2
mon addr = 10.10.10.252:6789
Each node has two ethernet cards in LACP bond on network 10.10.10.x
auto bond0
iface bond0 inet static
address 10.10.10.252
netmask 255.255.255.0
slaves enp4s0 enp4s1
bond_miimon 100
bond_mode 802.3ad
bond_xmit_hash_policy layer3+4
#CLUSTER BOND
The LAG on switch (TPLink TL-SG2008) is enabled, I see from "show run"
#
interface gigabitEthernet 1/0/1
channel-group 4 mode active
#
interface gigabitEthernet 1/0/2
channel-group 4 mode active
#
interface gigabitEthernet 1/0/3
channel-group 2 mode active
#
interface gigabitEthernet 1/0/4
channel-group 2 mode active
#
interface gigabitEthernet 1/0/5
channel-group 3 mode active
#
interface gigabitEthernet 1/0/6
channel-group 3 mode active
#
interface gigabitEthernet 1/0/7
#
interface gigabitEthernet 1/0/8
Node 1 is on port 1 and 2, node 2 on port 3 and 4, node 3 on port 5
and 6
Routing table, show with "ip -4 route show table all"
default via 192.168.2.1 dev vmbr0 onlink
*10.10.10.0/24 dev bond0 proto kernel scope link src 10.10.10.252*
192.168.1.0/24 dev vmbr1 proto kernel scope link src 192.168.1.252
linkdown
192.168.2.0/24 dev vmbr0 proto kernel scope link src 192.168.2.252
*broadcast 10.10.10.0 dev bond0 table local proto kernel scope link
src
10.10.10.252*
*local 10.10.10.252 dev bond0 table local proto kernel scope host src
10.10.10.252*
*broadcast 10.10.10.255 dev bond0 table local proto kernel scope
link src
10.10.10.252*
broadcast 127.0.0.0 dev lo table local proto kernel scope link src
127.0.0.1
local 127.0.0.0/8 dev lo table local proto kernel scope host src
127.0.0.1
local 127.0.0.1 dev lo table local proto kernel scope host src
127.0.0.1
broadcast 127.255.255.255 dev lo table local proto kernel scope
link src 127.0.0.1
broadcast 192.168.1.0 dev vmbr1 table local proto kernel scope link
src 192.168.1.252 linkdown
local 192.168.1.252 dev vmbr1 table local proto kernel scope host
src 192.168.1.252
broadcast 192.168.1.255 dev vmbr1 table local proto kernel scope
link src 192.168.1.252 linkdown
broadcast 192.168.2.0 dev vmbr0 table local proto kernel scope link
src 192.168.2.252
local 192.168.2.252 dev vmbr0 table local proto kernel scope host
src 192.168.2.252
broadcast 192.168.2.255 dev vmbr0 table local proto kernel scope
link src 192.168.2.252
Network configuration
*$ ip -4 a*
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
group default qlen 1000
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
6: vmbr1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc
noqueue state DOWN group default qlen 1000
inet 192.168.1.252/24 brd 192.168.1.255 scope global vmbr1
valid_lft forever preferred_lft forever
*7: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc
noqueue
state UP group default qlen 1000****inet 10.10.10.252/24 brd
10.10.10.255
scope global bond0****valid_lft forever preferred_lft forever***8:
vmbr0:
<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
group
default qlen 1000
inet 192.168.2.252/24 brd 192.168.2.255 scope global vmbr0
valid_lft forever preferred_lft forever
*$ ip -4 link*
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
pfifo_fast master vmbr0 state UP mode DEFAULT group default qlen 1000
link/ether 40:8d:5c:b0:2d:fe brd ff:ff:ff:ff:ff:ff
*3: enp4s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc
pfifo_fast master bond0 state UP mode DEFAULT group default qlen
1000****link/ether 98:de:d0:1d:75:4a brd ff:ff:ff:ff:ff:ff****4:
enp4s1:
<BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
master
bond0 state UP mode DEFAULT group default qlen 1000****link/ether
98:de:d0:1d:75:4a brd ff:ff:ff:ff:ff:ff***6: vmbr1:
<NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state
DOWN mode
DEFAULT group default qlen 1000
link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
*7: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc
noqueue
state UP mode DEFAULT group default qlen 1000****link/ether
98:de:d0:1d:75:4a brd ff:ff:ff:ff:ff:ff***8: vmbr0:
<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode
DEFAULT group default qlen 1000
link/ether 40:8d:5c:b0:2d:fe brd ff:ff:ff:ff:ff:ff
9: tap104i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500
qdisc pfifo_fast master vmbr0 state UNKNOWN mode DEFAULT group
default qlen 1000
link/ether b2:47:55:9f:d3:0b brd ff:ff:ff:ff:ff:ff
11: veth103i0@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
qdisc noqueue master vmbr0 state UP mode DEFAULT group default qlen
1000
link/ether fe:03:27:0d:02:38 brd ff:ff:ff:ff:ff:ff link-netnsid 0
13: veth106i0@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
qdisc noqueue master vmbr0 state UP mode DEFAULT group default qlen
1000
link/ether fe:ce:4f:09:24:45 brd ff:ff:ff:ff:ff:ff link-netnsid 1
14: tap109i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500
qdisc mq master vmbr0 state UNKNOWN mode DEFAULT group default qlen
1000
link/ether 3a:f0:99:3f:6a:75 brd ff:ff:ff:ff:ff:ff
15: tap201i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500
qdisc pfifo_fast master vmbr0 state UNKNOWN mode DEFAULT group
default qlen 1000
link/ether 16:99:8a:56:6d:7f brd ff:ff:ff:ff:ff:ff
I think it's everything.
Thanks
Il 23/10/2017 15:42, Denes Dolhay ha scritto:
Hi,
Maybe some routing issue?
"CEPH has public and cluster network on 10.10.10.0/24"
This means that the nodes have public and cluster network separately
both on 10.10.10.0/24, or that you did not specify a separate cluster
network?
Please provide route table, ifconfig, ceph.conf
Regards,
Denes
On 10/23/2017 03:35 PM, Marco Baldini - H.S. Amiata wrote:
Hello
I have a CEPH cluster with 3 nodes, each with 3 OSDs, running
Proxmox, CEPH versions:
{
"mon": {
"ceph version 12.2.1
(1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 3
},
"mgr": {
"ceph version 12.2.1
(1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 3
},
"osd": {
"ceph version 12.2.1
(1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 9
},
"mds": {},
"overall": {
"ceph version 12.2.1
(1a629971a9bcaaae99e5539a3a43f800a297f267) luminous (stable)": 15
}
}
CEPH has public and cluster network on 10.10.10.0/24, the three
nodes are 10.10.10.251, 10.10.10.252, 10.10.10.253 and networking is
working good (I kept ping from one of the nodes to the others two
running for hours and had 0 packet loss)
On one node with ip 10.10.10.252 I get strange message in dmesg
kern :info : [Oct23 14:42] libceph: mon2 10.10.10.253:6789
session lost, hunting for new mon
kern :info : [ +0.000391] libceph: mon1 10.10.10.252:6789
session established
kern :info : [ +30.721869] libceph: mon1 10.10.10.252:6789
session lost, hunting for new mon
kern :info : [ +0.000749] libceph: mon2 10.10.10.253:6789
session established
kern :info : [Oct23 14:43] libceph: mon2 10.10.10.253:6789
session lost, hunting for new mon
kern :info : [ +0.000312] libceph: mon1 10.10.10.252:6789
session established
kern :info : [ +30.721964] libceph: mon1 10.10.10.252:6789
session lost, hunting for new mon
kern :info : [ +0.000730] libceph: mon0 10.10.10.251:6789
session established
kern :info : [Oct23 14:44] libceph: mon0 10.10.10.251:6789
session lost, hunting for new mon
kern :info : [ +0.000330] libceph: mon1 10.10.10.252:6789
session established
kern :info : [ +30.721899] libceph: mon1 10.10.10.252:6789
session lost, hunting for new mon
kern :info : [ +0.000951] libceph: mon0 10.10.10.251:6789
session established
kern :info : [Oct23 14:45] libceph: mon0 10.10.10.251:6789
session lost, hunting for new mon
kern :info : [ +0.000733] libceph: mon2 10.10.10.253:6789
session established
kern :info : [ +30.721529] libceph: mon2 10.10.10.253:6789
session lost, hunting for new mon
kern :info : [ +0.000328] libceph: mon1 10.10.10.252:6789
session established
kern :info : [Oct23 14:46] libceph: mon1 10.10.10.252:6789
session lost, hunting for new mon
kern :info : [ +0.001035] libceph: mon0 10.10.10.251:6789
session established
kern :info : [ +30.721183] libceph: mon0 10.10.10.251:6789
session lost, hunting for new mon
kern :info : [ +0.004221] libceph: mon1 10.10.10.252:6789
session established
kern :info : [Oct23 14:47] libceph: mon1 10.10.10.252:6789
session lost, hunting for new mon
kern :info : [ +0.000927] libceph: mon0 10.10.10.251:6789
session established
kern :info : [ +30.721361] libceph: mon0 10.10.10.251:6789
session lost, hunting for new mon
kern :info : [ +0.000524] libceph: mon1 10.10.10.252:6789
session established
and that is going on all the day.
In ceph -w I get
2017-10-23 14:51:57.941131 mon.pve-hs-main [INF] mon.2
10.10.10.253:6789/0
2017-10-23 14:56:57.941433 mon.pve-hs-main [INF] mon.2
10.10.10.253:6789/0
2017-10-23 14:56:58.124457 mon.pve-hs-main [INF] mon.1
10.10.10.252:6789/0
2017-10-23 15:00:00.000184 mon.pve-hs-main [INF] overall HEALTH_OK
2017-10-23 15:01:57.941312 mon.pve-hs-main [INF] mon.1
10.10.10.252:6789/0
2017-10-23 15:01:57.941558 mon.pve-hs-main [INF] mon.2
10.10.10.253:6789/0
2017-10-23 15:06:57.941420 mon.pve-hs-main [INF] mon.1
10.10.10.252:6789/0
2017-10-23 15:06:57.941544 mon.pve-hs-main [INF] mon.2
10.10.10.253:6789/0
2017-10-23 15:11:57.941573 mon.pve-hs-main [INF] mon.1
10.10.10.252:6789/0
2017-10-23 15:11:57.941659 mon.pve-hs-main [INF] mon.2
10.10.10.253:6789/0
pve-hs-main is the host with ip 10.10.10.251
Actually CEPH storage is very low on usage, on average 200 kB/s read
or write (as shown with ceph -s) so I don't think it's a problem
about load average of the cluster.
The strange is that I see mon1 10.10.10.252:6789 session lost and
that's from log of node 10.10.10.252 so it's losing connection with
the monitor on the same node, I don't think it's network related.
I already tried with nodes reboot, ceph-mon and ceph-mgr restart,
but the problem is still there.
Any ideas?
Thanks
--
*Marco Baldini*
*H.S. Amiata Srl*
Ufficio: 0577-779396
Cellulare: 335-8765169
WEB: www.hsamiata.it <https://www.hsamiata.it>
EMAIL: mbaldini@xxxxxxxxxxx <mailto:mbaldini@xxxxxxxxxxx>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
*Marco Baldini*
*H.S. Amiata Srl*
Ufficio: 0577-779396
Cellulare: 335-8765169
WEB: www.hsamiata.it <https://www.hsamiata.it>
EMAIL: mbaldini@xxxxxxxxxxx <mailto:mbaldini@xxxxxxxxxxx>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com