Hi experts,
we have setup a proxmox cluster with a minimal environment for some testing.
We have put some VMs on the cluster and encountered mon quorum problems
while backups are executed. (possibly polluting either hard disk I/O or network I/O)
Setup:
4 Machines with Proxmox 5.2-2 (Ceph 12.2.5 luminous)
3 ceph mons
8 osd (2 per machine, each 2TB disk space, usage 25%), with bluestore
3 bond NIC (balance-alb) active (1GBit for proxmox machine access, one (10GBit) for ceph public and one (10GBit) for ceph cluster
Ceph config as follows
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 192.168.17.0/24
fsid = 5070e036-8f6c-4795-a34d-9035472a628d
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
public network = 192.168.16.0/24
[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring
[mon.ariel2]
host = ariel2
mon addr = 192.168.16.32:6789
[mon.ariel1]
host = ariel1
mon addr = 192.168.16.31:6789
[mon.ariel4]
host = ariel4
mon addr = 192.168.16.34:6789
[osd.0]
public addr = 192.168.16.32
cluster addr = 192.168.17.32
[osd.1]
public addr = 192.168.16.34
cluster addr = 192.168.17.34
[osd.2]
public addr = 192.168.16.31
cluster addr = 192.168.17.31
[osd.3]
public addr = 192.168.16.31
cluster addr = 192.168.17.31
[osd.4]
public addr = 192.168.16.32
cluster addr = 192.168.17.32
[osd.5]
public addr = 192.168.16.34
cluster addr = 192.168.17.34
[osd.6]
public addr = 192.168.16.33
cluster addr = 192.168.17.33
[osd.7]
public addr = 192.168.16.33
cluster addr = 192.168.17.33
Everything is running smoothly until a backup is taken:
(from machine 2)
2018-07-06 02:47:54.691483 mon.ariel4 mon.2 192.168.16.34:6789/0 30663 : cluster [INF] mon.ariel4 calling monitor election
2018-07-06 02:47:54.754901 mon.ariel2 mon.1 192.168.16.32:6789/0 29602 : cluster [INF] mon.ariel2 calling monitor election
2018-07-06 02:47:59.934534 mon.ariel2 mon.1 192.168.16.32:6789/0 29603 : cluster [INF] mon.ariel2 is new leader, mons ariel2,ariel4 in quorum (ranks 1,2)
2018-07-06 02:48:00.056711 mon.ariel2 mon.1 192.168.16.32:6789/0 29608 : cluster [WRN] Health check failed: 1/3 mons down, quorum ariel2,ariel4 (MON_DOWN)
2018-07-06 02:48:00.133880 mon.ariel2 mon.1 192.168.16.32:6789/0 29610 : cluster [WRN] overall HEALTH_WARN 1/3 mons down, quorum ariel2,ariel4
2018-07-06 02:48:09.480385 mon.ariel1 mon.0 192.168.16.31:6789/0 33856 : cluster [INF] mon.ariel1 calling monitor election
2018-07-06 02:48:09.635420 mon.ariel4 mon.2 192.168.16.34:6789/0 30666 : cluster [INF] mon.ariel4 calling monitor election
2018-07-06 02:48:09.635729 mon.ariel2 mon.1 192.168.16.32:6789/0 29613 : cluster [INF] mon.ariel2 calling monitor election
2018-07-06 02:48:09.723634 mon.ariel1 mon.0 192.168.16.31:6789/0 33857 : cluster [INF] mon.ariel1 calling monitor election
2018-07-06 02:48:10.059104 mon.ariel1 mon.0 192.168.16.31:6789/0 33858 : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in quorum (ranks 0,1,2)
2018-07-06 02:48:10.587894 mon.ariel1 mon.0 192.168.16.31:6789/0 33863 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum ariel2,ariel4)
2018-07-06 02:48:10.587910 mon.ariel1 mon.0 192.168.16.31:6789/0 33864 : cluster [INF] Cluster is now healthy
2018-07-06 02:48:22.038196 mon.ariel4 mon.2 192.168.16.34:6789/0 30668 : cluster [INF] mon.ariel4 calling monitor election
2018-07-06 02:48:22.078876 mon.ariel2 mon.1 192.168.16.32:6789/0 29615 : cluster [INF] mon.ariel2 calling monitor election
2018-07-06 02:48:27.197263 mon.ariel2 mon.1 192.168.16.32:6789/0 29616 : cluster [INF] mon.ariel2 is new leader, mons ariel2,ariel4 in quorum (ranks 1,2)
2018-07-06 02:48:27.237330 mon.ariel2 mon.1 192.168.16.32:6789/0 29621 : cluster [WRN] Health check failed: 1/3 mons down, quorum ariel2,ariel4 (MON_DOWN)
2018-07-06 02:48:27.357095 mon.ariel2 mon.1 192.168.16.32:6789/0 29622 : cluster [WRN] overall HEALTH_WARN 1/3 mons down, quorum ariel2,ariel4
2018-07-06 02:48:32.456742 mon.ariel1 mon.0 192.168.16.31:6789/0 33867 : cluster [INF] mon.ariel1 calling monitor election
2018-07-06 02:48:33.011025 mon.ariel1 mon.0 192.168.16.31:6789/0 33868 : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in quorum (ranks 0,1,2)
2018-07-06 02:48:33.967501 mon.ariel1 mon.0 192.168.16.31:6789/0 33873 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum ariel2,ariel4)
2018-07-06 02:48:33.967523 mon.ariel1 mon.0 192.168.16.31:6789/0 33874 : cluster [INF] Cluster is now healthy
2018-07-06 02:48:35.002941 mon.ariel1 mon.0 192.168.16.31:6789/0 33875 : cluster [INF] overall HEALTH_OK
2018-07-06 02:49:11.927388 mon.ariel4 mon.2 192.168.16.34:6789/0 30675 : cluster [INF] mon.ariel4 calling monitor election
2018-07-06 02:49:12.001371 mon.ariel2 mon.1 192.168.16.32:6789/0 29629 : cluster [INF] mon.ariel2 calling monitor election
2018-07-06 02:49:17.163727 mon.ariel2 mon.1 192.168.16.32:6789/0 29630 : cluster [INF] mon.ariel2 is new leader, mons ariel2,ariel4 in quorum (ranks 1,2)
2018-07-06 02:49:17.199214 mon.ariel2 mon.1 192.168.16.32:6789/0 29635 : cluster [WRN] Health check failed: 1/3 mons down, quorum ariel2,ariel4 (MON_DOWN)
2018-07-06 02:49:17.296646 mon.ariel2 mon.1 192.168.16.32:6789/0 29636 : cluster [WRN] overall HEALTH_WARN 1/3 mons down, quorum ariel2,ariel4
2018-07-06 02:49:47.014202 mon.ariel1 mon.0 192.168.16.31:6789/0 33880 : cluster [INF] mon.ariel1 calling monitor election
2018-07-06 02:49:47.357144 mon.ariel1 mon.0 192.168.16.31:6789/0 33881 : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in quorum (ranks 0,1,2)
2018-07-06 02:49:47.639535 mon.ariel1 mon.0 192.168.16.31:6789/0 33886 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum ariel2,ariel4)
2018-07-06 02:49:47.639553 mon.ariel1 mon.0 192.168.16.31:6789/0 33887 : cluster [INF] Cluster is now healthy
2018-07-06 02:49:47.810993 mon.ariel1 mon.0 192.168.16.31:6789/0 33888 : cluster [INF] overall HEALTH_OK
2018-07-06 02:49:59.349085 mon.ariel4 mon.2 192.168.16.34:6789/0 30681 : cluster [INF] mon.ariel4 calling monitor election
2018-07-06 02:49:59.427457 mon.ariel2 mon.1 192.168.16.32:6789/0 29648 : cluster [INF] mon.ariel2 calling monitor election
2018-07-06 02:50:02.978856 mon.ariel1 mon.0 192.168.16.31:6789/0 33889 : cluster [INF] mon.ariel1 calling monitor election
2018-07-06 02:50:03.299621 mon.ariel1 mon.0 192.168.16.31:6789/0 33890 : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in quorum (ranks 0,1,2)
From machine 1:
2018-07-06 02:47:08.541710 mon.ariel2 mon.1 192.168.16.32:6789/0 29590 : cluster [INF] mon.ariel2 calling monitor election
2018-07-06 02:47:12.949379 mon.ariel1 mon.0 192.168.16.31:6789/0 33844 : cluster [INF] mon.ariel1 calling monitor election
2018-07-06 02:47:13.929753 mon.ariel1 mon.0 192.168.16.31:6789/0 33845 : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in quorum (ranks 0,1,2)
2018-07-06 02:47:16.793479 mon.ariel1 mon.0 192.168.16.31:6789/0 33850 : cluster [INF] overall HEALTH_OK
2018-07-06 02:48:09.480385 mon.ariel1 mon.0 192.168.16.31:6789/0 33856 : cluster [INF] mon.ariel1 calling monitor election
2018-07-06 02:48:09.635420 mon.ariel4 mon.2 192.168.16.34:6789/0 30666 : cluster [INF] mon.ariel4 calling monitor election
2018-07-06 02:48:09.635729 mon.ariel2 mon.1 192.168.16.32:6789/0 29613 : cluster [INF] mon.ariel2 calling monitor election
2018-07-06 02:48:09.723634 mon.ariel1 mon.0 192.168.16.31:6789/0 33857 : cluster [INF] mon.ariel1 calling monitor election
2018-07-06 02:48:10.059104 mon.ariel1 mon.0 192.168.16.31:6789/0 33858 : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in quorum (ranks 0,1,2)
2018-07-06 02:48:10.587894 mon.ariel1 mon.0 192.168.16.31:6789/0 33863 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum ariel2,ariel4)
2018-07-06 02:48:10.587910 mon.ariel1 mon.0 192.168.16.31:6789/0 33864 : cluster [INF] Cluster is now healthy
2018-07-06 02:48:32.456742 mon.ariel1 mon.0 192.168.16.31:6789/0 33867 : cluster [INF] mon.ariel1 calling monitor election
2018-07-06 02:48:33.011025 mon.ariel1 mon.0 192.168.16.31:6789/0 33868 : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in quorum (ranks 0,1,2)
2018-07-06 02:48:33.967501 mon.ariel1 mon.0 192.168.16.31:6789/0 33873 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum ariel2,ariel4)
2018-07-06 02:48:33.967523 mon.ariel1 mon.0 192.168.16.31:6789/0 33874 : cluster [INF] Cluster is now healthy
2018-07-06 02:48:35.002941 mon.ariel1 mon.0 192.168.16.31:6789/0 33875 : cluster [INF] overall HEALTH_OK
2018-07-06 02:49:47.014202 mon.ariel1 mon.0 192.168.16.31:6789/0 33880 : cluster [INF] mon.ariel1 calling monitor election
2018-07-06 02:49:47.357144 mon.ariel1 mon.0 192.168.16.31:6789/0 33881 : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in quorum (ranks 0,1,2)
2018-07-06 02:49:47.639535 mon.ariel1 mon.0 192.168.16.31:6789/0 33886 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum ariel2,ariel4)
2018-07-06 02:49:47.639553 mon.ariel1 mon.0 192.168.16.31:6789/0 33887 : cluster [INF] Cluster is now healthy
2018-07-06 02:49:47.810993 mon.ariel1 mon.0 192.168.16.31:6789/0 33888 : cluster [INF] overall HEALTH_OK
2018-07-06 02:49:59.349085 mon.ariel4 mon.2 192.168.16.34:6789/0 30681 : cluster [INF] mon.ariel4 calling monitor election
2018-07-06 02:49:59.427457 mon.ariel2 mon.1 192.168.16.32:6789/0 29648 : cluster [INF] mon.ariel2 calling monitor election
2018-07-06 02:50:02.978856 mon.ariel1 mon.0 192.168.16.31:6789/0 33889 : cluster [INF] mon.ariel1 calling monitor election
2018-07-06 02:50:03.299621 mon.ariel1 mon.0 192.168.16.31:6789/0 33890 : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in quorum (ranks 0,1,2)
2018-07-06 02:50:03.642986 mon.ariel1 mon.0 192.168.16.31:6789/0 33895 : cluster [INF] overall HEALTH_OK
2018-07-06 02:50:46.757619 mon.ariel1 mon.0 192.168.16.31:6789/0 33899 : cluster [INF] mon.ariel1 calling monitor election
2018-07-06 02:50:46.920468 mon.ariel1 mon.0 192.168.16.31:6789/0 33900 : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in quorum (ranks 0,1,2)
2018-07-06 02:50:47.104222 mon.ariel1 mon.0 192.168.16.31:6789/0 33905 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum ariel2,ariel4)
2018-07-06 02:50:47.104240 mon.ariel1 mon.0 192.168.16.31:6789/0 33906 : cluster [INF] Cluster is now healthy
2018-07-06 02:50:47.256301 mon.ariel1 mon.0 192.168.16.31:6789/0 33907 : cluster [INF] overall HEALTH_OK
There seems to be some disturbance of mon traffic.
Since the mons are communicating via a 10GBit interface, I would not assume a problem here.
There are no errors logged either on the network interfaces or on the switches.
Maybe the disks are too slow (osds are on SATA), so we are thinking about putting the bluestore journal on an SSD.
But would that action help to stabilize the mons ?
Or would a setup with 5 machines (5 mons running) be the better choice ?
So we are a little stuck where to search for a solution.
What debug output would help to see whether we have a disk or network problem here ?
Thankx for your input !
Marcus Haarmann
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com