Re: Ceph mon quorum problems under load

John Spray <jspray@xxxxxxxxxx> · Fri, 6 Jul 2018 12:04:37 +0100

On Fri, Jul 6, 2018 at 11:10 AM Marcus Haarmann
<marcus.haarmann@xxxxxxxxx> wrote:
>
> Hi experts,
>
> we have setup a proxmox cluster with a minimal environment for some testing.
> We have put some VMs on the cluster and encountered mon quorum problems
> while backups are executed. (possibly polluting either hard disk I/O or network I/O)
> Setup:
> 4 Machines with Proxmox 5.2-2 (Ceph 12.2.5 luminous)
> 3 ceph mons
> 8 osd (2 per machine, each 2TB disk space, usage 25%), with bluestore
> 3 bond NIC (balance-alb) active (1GBit for proxmox machine access, one (10GBit) for ceph public and one (10GBit) for ceph cluster
>
> Ceph config as follows
>
> [global]
> auth client required = cephx
> auth cluster required = cephx
> auth service required = cephx
> cluster network = 192.168.17.0/24
> fsid = 5070e036-8f6c-4795-a34d-9035472a628d
> keyring = /etc/pve/priv/$cluster.$name.keyring
> mon allow pool delete = true
> osd journal size = 5120
> osd pool default min size = 2
> osd pool default size = 3
> public network = 192.168.16.0/24
>
> [osd]
> keyring = /var/lib/ceph/osd/ceph-$id/keyring
>
> [mon.ariel2]
> host = ariel2
> mon addr = 192.168.16.32:6789
>
> [mon.ariel1]
> host = ariel1
> mon addr = 192.168.16.31:6789
> [mon.ariel4]
> host = ariel4
> mon addr = 192.168.16.34:6789
>
> [osd.0]
> public addr = 192.168.16.32
> cluster addr = 192.168.17.32
>
> [osd.1]
> public addr = 192.168.16.34
> cluster addr = 192.168.17.34
>
> [osd.2]
> public addr = 192.168.16.31
> cluster addr = 192.168.17.31
>
> [osd.3]
> public addr = 192.168.16.31
> cluster addr = 192.168.17.31
>
> [osd.4]
> public addr = 192.168.16.32
> cluster addr = 192.168.17.32
>
> [osd.5]
> public addr = 192.168.16.34
> cluster addr = 192.168.17.34
> [osd.6]
> public addr = 192.168.16.33
> cluster addr = 192.168.17.33
> [osd.7]
> public addr = 192.168.16.33
> cluster addr = 192.168.17.33
>
> Everything is running smoothly until a backup is taken:
> (from machine 2)
> 2018-07-06 02:47:54.691483 mon.ariel4 mon.2 192.168.16.34:6789/0 30663 : cluster [INF] mon.ariel4 calling monitor election
> 2018-07-06 02:47:54.754901 mon.ariel2 mon.1 192.168.16.32:6789/0 29602 : cluster [INF] mon.ariel2 calling monitor election
> 2018-07-06 02:47:59.934534 mon.ariel2 mon.1 192.168.16.32:6789/0 29603 : cluster [INF] mon.ariel2 is new leader, mons ariel2,ariel4 in quorum (ranks 1,2)
> 2018-07-06 02:48:00.056711 mon.ariel2 mon.1 192.168.16.32:6789/0 29608 : cluster [WRN] Health check failed: 1/3 mons down, quorum ariel2,ariel4 (MON_DOWN)
> 2018-07-06 02:48:00.133880 mon.ariel2 mon.1 192.168.16.32:6789/0 29610 : cluster [WRN] overall HEALTH_WARN 1/3 mons down, quorum ariel2,ariel4
> 2018-07-06 02:48:09.480385 mon.ariel1 mon.0 192.168.16.31:6789/0 33856 : cluster [INF] mon.ariel1 calling monitor election
> 2018-07-06 02:48:09.635420 mon.ariel4 mon.2 192.168.16.34:6789/0 30666 : cluster [INF] mon.ariel4 calling monitor election
> 2018-07-06 02:48:09.635729 mon.ariel2 mon.1 192.168.16.32:6789/0 29613 : cluster [INF] mon.ariel2 calling monitor election
> 2018-07-06 02:48:09.723634 mon.ariel1 mon.0 192.168.16.31:6789/0 33857 : cluster [INF] mon.ariel1 calling monitor election
> 2018-07-06 02:48:10.059104 mon.ariel1 mon.0 192.168.16.31:6789/0 33858 : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in quorum (ranks 0,1,2)
> 2018-07-06 02:48:10.587894 mon.ariel1 mon.0 192.168.16.31:6789/0 33863 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum ariel2,ariel4)
> 2018-07-06 02:48:10.587910 mon.ariel1 mon.0 192.168.16.31:6789/0 33864 : cluster [INF] Cluster is now healthy
> 2018-07-06 02:48:22.038196 mon.ariel4 mon.2 192.168.16.34:6789/0 30668 : cluster [INF] mon.ariel4 calling monitor election
> 2018-07-06 02:48:22.078876 mon.ariel2 mon.1 192.168.16.32:6789/0 29615 : cluster [INF] mon.ariel2 calling monitor election
> 2018-07-06 02:48:27.197263 mon.ariel2 mon.1 192.168.16.32:6789/0 29616 : cluster [INF] mon.ariel2 is new leader, mons ariel2,ariel4 in quorum (ranks 1,2)
> 2018-07-06 02:48:27.237330 mon.ariel2 mon.1 192.168.16.32:6789/0 29621 : cluster [WRN] Health check failed: 1/3 mons down, quorum ariel2,ariel4 (MON_DOWN)
> 2018-07-06 02:48:27.357095 mon.ariel2 mon.1 192.168.16.32:6789/0 29622 : cluster [WRN] overall HEALTH_WARN 1/3 mons down, quorum ariel2,ariel4
> 2018-07-06 02:48:32.456742 mon.ariel1 mon.0 192.168.16.31:6789/0 33867 : cluster [INF] mon.ariel1 calling monitor election
> 2018-07-06 02:48:33.011025 mon.ariel1 mon.0 192.168.16.31:6789/0 33868 : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in quorum (ranks 0,1,2)
> 2018-07-06 02:48:33.967501 mon.ariel1 mon.0 192.168.16.31:6789/0 33873 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum ariel2,ariel4)
> 2018-07-06 02:48:33.967523 mon.ariel1 mon.0 192.168.16.31:6789/0 33874 : cluster [INF] Cluster is now healthy
> 2018-07-06 02:48:35.002941 mon.ariel1 mon.0 192.168.16.31:6789/0 33875 : cluster [INF] overall HEALTH_OK
> 2018-07-06 02:49:11.927388 mon.ariel4 mon.2 192.168.16.34:6789/0 30675 : cluster [INF] mon.ariel4 calling monitor election
> 2018-07-06 02:49:12.001371 mon.ariel2 mon.1 192.168.16.32:6789/0 29629 : cluster [INF] mon.ariel2 calling monitor election
> 2018-07-06 02:49:17.163727 mon.ariel2 mon.1 192.168.16.32:6789/0 29630 : cluster [INF] mon.ariel2 is new leader, mons ariel2,ariel4 in quorum (ranks 1,2)
> 2018-07-06 02:49:17.199214 mon.ariel2 mon.1 192.168.16.32:6789/0 29635 : cluster [WRN] Health check failed: 1/3 mons down, quorum ariel2,ariel4 (MON_DOWN)
> 2018-07-06 02:49:17.296646 mon.ariel2 mon.1 192.168.16.32:6789/0 29636 : cluster [WRN] overall HEALTH_WARN 1/3 mons down, quorum ariel2,ariel4
> 2018-07-06 02:49:47.014202 mon.ariel1 mon.0 192.168.16.31:6789/0 33880 : cluster [INF] mon.ariel1 calling monitor election
> 2018-07-06 02:49:47.357144 mon.ariel1 mon.0 192.168.16.31:6789/0 33881 : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in quorum (ranks 0,1,2)
> 2018-07-06 02:49:47.639535 mon.ariel1 mon.0 192.168.16.31:6789/0 33886 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum ariel2,ariel4)
> 2018-07-06 02:49:47.639553 mon.ariel1 mon.0 192.168.16.31:6789/0 33887 : cluster [INF] Cluster is now healthy
> 2018-07-06 02:49:47.810993 mon.ariel1 mon.0 192.168.16.31:6789/0 33888 : cluster [INF] overall HEALTH_OK
> 2018-07-06 02:49:59.349085 mon.ariel4 mon.2 192.168.16.34:6789/0 30681 : cluster [INF] mon.ariel4 calling monitor election
> 2018-07-06 02:49:59.427457 mon.ariel2 mon.1 192.168.16.32:6789/0 29648 : cluster [INF] mon.ariel2 calling monitor election
> 2018-07-06 02:50:02.978856 mon.ariel1 mon.0 192.168.16.31:6789/0 33889 : cluster [INF] mon.ariel1 calling monitor election
> 2018-07-06 02:50:03.299621 mon.ariel1 mon.0 192.168.16.31:6789/0 33890 : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in quorum (ranks 0,1,2)
>
> From machine 1:
> 2018-07-06 02:47:08.541710 mon.ariel2 mon.1 192.168.16.32:6789/0 29590 : cluster [INF] mon.ariel2 calling monitor election
> 2018-07-06 02:47:12.949379 mon.ariel1 mon.0 192.168.16.31:6789/0 33844 : cluster [INF] mon.ariel1 calling monitor election
> 2018-07-06 02:47:13.929753 mon.ariel1 mon.0 192.168.16.31:6789/0 33845 : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in quorum (ranks 0,1,2)
> 2018-07-06 02:47:16.793479 mon.ariel1 mon.0 192.168.16.31:6789/0 33850 : cluster [INF] overall HEALTH_OK
> 2018-07-06 02:48:09.480385 mon.ariel1 mon.0 192.168.16.31:6789/0 33856 : cluster [INF] mon.ariel1 calling monitor election
> 2018-07-06 02:48:09.635420 mon.ariel4 mon.2 192.168.16.34:6789/0 30666 : cluster [INF] mon.ariel4 calling monitor election
> 2018-07-06 02:48:09.635729 mon.ariel2 mon.1 192.168.16.32:6789/0 29613 : cluster [INF] mon.ariel2 calling monitor election
> 2018-07-06 02:48:09.723634 mon.ariel1 mon.0 192.168.16.31:6789/0 33857 : cluster [INF] mon.ariel1 calling monitor election
> 2018-07-06 02:48:10.059104 mon.ariel1 mon.0 192.168.16.31:6789/0 33858 : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in quorum (ranks 0,1,2)
> 2018-07-06 02:48:10.587894 mon.ariel1 mon.0 192.168.16.31:6789/0 33863 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum ariel2,ariel4)
> 2018-07-06 02:48:10.587910 mon.ariel1 mon.0 192.168.16.31:6789/0 33864 : cluster [INF] Cluster is now healthy
> 2018-07-06 02:48:32.456742 mon.ariel1 mon.0 192.168.16.31:6789/0 33867 : cluster [INF] mon.ariel1 calling monitor election
> 2018-07-06 02:48:33.011025 mon.ariel1 mon.0 192.168.16.31:6789/0 33868 : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in quorum (ranks 0,1,2)
> 2018-07-06 02:48:33.967501 mon.ariel1 mon.0 192.168.16.31:6789/0 33873 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum ariel2,ariel4)
> 2018-07-06 02:48:33.967523 mon.ariel1 mon.0 192.168.16.31:6789/0 33874 : cluster [INF] Cluster is now healthy
> 2018-07-06 02:48:35.002941 mon.ariel1 mon.0 192.168.16.31:6789/0 33875 : cluster [INF] overall HEALTH_OK
> 2018-07-06 02:49:47.014202 mon.ariel1 mon.0 192.168.16.31:6789/0 33880 : cluster [INF] mon.ariel1 calling monitor election
> 2018-07-06 02:49:47.357144 mon.ariel1 mon.0 192.168.16.31:6789/0 33881 : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in quorum (ranks 0,1,2)
> 2018-07-06 02:49:47.639535 mon.ariel1 mon.0 192.168.16.31:6789/0 33886 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum ariel2,ariel4)
> 2018-07-06 02:49:47.639553 mon.ariel1 mon.0 192.168.16.31:6789/0 33887 : cluster [INF] Cluster is now healthy
> 2018-07-06 02:49:47.810993 mon.ariel1 mon.0 192.168.16.31:6789/0 33888 : cluster [INF] overall HEALTH_OK
> 2018-07-06 02:49:59.349085 mon.ariel4 mon.2 192.168.16.34:6789/0 30681 : cluster [INF] mon.ariel4 calling monitor election
> 2018-07-06 02:49:59.427457 mon.ariel2 mon.1 192.168.16.32:6789/0 29648 : cluster [INF] mon.ariel2 calling monitor election
> 2018-07-06 02:50:02.978856 mon.ariel1 mon.0 192.168.16.31:6789/0 33889 : cluster [INF] mon.ariel1 calling monitor election
> 2018-07-06 02:50:03.299621 mon.ariel1 mon.0 192.168.16.31:6789/0 33890 : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in quorum (ranks 0,1,2)
> 2018-07-06 02:50:03.642986 mon.ariel1 mon.0 192.168.16.31:6789/0 33895 : cluster [INF] overall HEALTH_OK
> 2018-07-06 02:50:46.757619 mon.ariel1 mon.0 192.168.16.31:6789/0 33899 : cluster [INF] mon.ariel1 calling monitor election
> 2018-07-06 02:50:46.920468 mon.ariel1 mon.0 192.168.16.31:6789/0 33900 : cluster [INF] mon.ariel1 is new leader, mons ariel1,ariel2,ariel4 in quorum (ranks 0,1,2)
> 2018-07-06 02:50:47.104222 mon.ariel1 mon.0 192.168.16.31:6789/0 33905 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum ariel2,ariel4)
> 2018-07-06 02:50:47.104240 mon.ariel1 mon.0 192.168.16.31:6789/0 33906 : cluster [INF] Cluster is now healthy
> 2018-07-06 02:50:47.256301 mon.ariel1 mon.0 192.168.16.31:6789/0 33907 : cluster [INF] overall HEALTH_OK
>
>
> There seems to be some disturbance of mon traffic.
> Since the mons are communicating via a 10GBit interface, I would not assume a problem here.
> There are no errors logged either on the network interfaces or on the switches.
>
> Maybe the disks are too slow (osds are on SATA), so we are thinking about putting the bluestore journal on an SSD.
> But would that action help to stabilize the mons ?

Probably not -- if anything, giving the OSDs faster storage will
alleviate that bottleneck and enable the OSD daemons to hit the CPU
even harder.

> Or would a setup with 5 machines (5 mons running) be the better choice ?

Adding more mons won't help if they're being disrupted by running on
overloaded nodes.

> So we are a little stuck where to search for a solution.
> What debug output would help to see whether we have a disk or network problem here ?

You didn't mention what storage devices the mons are using (i.e. what
device is hosting /var) -- hopefully it's an SSD that isn't loaded
with any other heavy workloads?

For checking how saturated your CPU and network are, you'd use any of
the standard linux tools (I'm a dstat fan, personally).

When running mons and OSDs on the same nodes, it is preferred to run
them in containers so that resource utilization can be limited, and
thereby protect the monitors from over-enthusiastic OSD daemons.
Otherwise, you will always have similar issues if any system resource
is saturated.  The alternative is to simply over-provision hardware to
the point that contention is not an issue (which is a bit expensive of
course).

John

>
> Thankx for your input !
>
> Marcus Haarmann
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com