Election deadlock after network split in stretch cluster

Florian Pritz <florian.pritz@xxxxxxxxxxxxxx> · Thu, 10 Mar 2022 18:33:10 +0100

Hi,

We have a test cluster for evaluating stretch mode and we are running
into an issue where the monitors fail to elect a leader after a network
split, even though all monitors are online and can reach each other.

I saw a presentation by Gregory Farnum from FOSDEM 2020 about stretch
clusters and his explantation of the connectivity strategy sounds like
this case should not happen, so I've put him on CC in case he can share
more details about this process.

Our cluster consists of 3 data centers, with a total of 5 monitor nodes
and 4 osd nodes. Two data centers each have 2 monitors and 2 osd nodes,
while the third data center provides a tiebreaker monitor.

We collected some debugging information with the following commands:

- `ceph daemon /var/run/ceph/ceph-mon.$(hostname).asok mon_status` for the
"mon rank".

- `ceph daemon /var/run/ceph/ceph-mon.$(hostname).asok connection scores
dump` for the "connectivity rank".

During the tests we used `watch -n1` to monitor the output of these
commands. This also means we may be missing information that changes too
quickly.

The issue we encounter happens when the following events occur:

1. All nodes are online, the monitors have monitor and connectivity
ranks 0 to 4. Each occurs exactly once and the monitor rank of a node
matches the connectivity rank of that node. I will refer to these nodes
as nodes A to E now. So node A has rank 0, B has 1, C 2, D 3, E 4.

2. One data center with nodes B and C goes offline. We simulate this by
running `ip route add blackhole $ip` on all machines in that one data center
to block all IPs of the other nodes from the other data centers. We do
this on monitor and OSD nodes at the "same" time (like within one
second).

3. The surviving nodes create a new quorum between nodes A, D and E. So
ranks 0, 3 and 4 have survived. Their connectivity scores change to 0,
1, 2, but their monitor rank stays the same. They go to "leader" and
"peon" in the monitor status. `ceph status` also shows that stretch mode
has detected a data center failure.

3.1. The monitors B and C in the offlined data center retain their monitor
and connectivity ranks of 1 and 2, but are stuck in "probing" (I think)
or "electing" state in the monitor status. I'm not 100% sure which state
they were in, but if that matters I can retest. They were certainly not
part of the quorum.

4. We restore the connection of the offline data center again by
removing the blackhole routes.

5. The nodes momentarily managed to create a quorum, but it collapsed
within seconds and we are not quite sure if the quorum contained all
nodes or just a subset of nodes.

6. The quorum collapses and the connectivity rank of nodes B to E
changes to 1. Their monitor ranks are still 1 to 4. The monitor and
connectivity rank of node A is still 0.

7. All monitors are stuck in "electing" state in the monitor status for
at least several minutes. We've seen this happen for hours before, but
back then we hadn't yet analysed it in this detail. After reproducing it
now, it stayed like this for at least 5 minutes.

During that time, the "epoch" in the monitor status increases by 2 every
(roughly wall timed) 5 seconds. Nothing else happens.

We believe that the connectivity strategy is unable to create a quorum
due to only nodes 0 and 1 being supposedly online. We believe that ceph
does not notice that the connectivity rank and the actual monitor rank of a
node differ before the connectivity data is sent to other nodes. Those
nodes simply collect all the data from nodes B to E under the data for
rank 1. Thus rank 2 to 4 are seen as offline, but at least three nodes
are required to build a quorum since our cluster contains 5 monitor
nodes.

8. We restart one node where the connectivity rank and the monitor rank
mismatch. In this case we decided to use node D with monitor rank 3 and
connectivity rank 1. After the restart it changed the connectivity rank
to 3 as well.

Now the cluster is able to find 3 nodes again to build a quorum and
after a few seconds all 5 nodes join a working quorum, even though nodes
C and E still show connectivity rank 1, together with node B. Node B
also shows monitor rank 1 so connectivity rank 1 sounds correct here.

It appears that the connectivity rank can only be decreased and never
increased again unless the monitor process is restarted. When failovers
happen and the quorum is reduced, the rank reduces and eventually the
cluster enters a state where the connectivity ranks are too low to
support the quorum requirement.

During brief testing, we were unable to reproduce this issue by simply
taking monitors offline, without also taking their osd nodes offline as
well. This may thus be related to special strech mode handling when an
entire data center fails.

I believe this is not supposed to happen and the cluster should be able
to 1) recover completely after the data center comes back online and 2)
the connectivity rank should match the monitor rank of each node.

I hope I've described the problem sufficiently detailed so that you can
reproduce it. If not, please do feel free to reach out.

Florian

PS: Resetting the connectity scores with `ceph daemon mon.{name}
connection scores reset` on a single node does not change the
connectivity rank and the problem persists. Restarting the monitor
process on that same node resolves the issue as described above.

PPS: The output of `connection scores dump` is not a proper JSON document
so parsing it with tools like jq drops a lot of information from the
reports array, which is actually a werid object instead of an array.

Just in case, our monmap looks like this:

> ceph mon dump
epoch 42
fsid cd30ad06-b025-40f5-9b5e-767087e8a955
last_changed 2022-03-10T17:13:42.783584+0100
created 2022-02-01T18:16:40.188610+0100
min_mon_release 16 (pacific)
election_strategy: 3
stretch_mode_enabled 1
tiebreaker_mon cephtiebreaker-vrz0506c1n1-test-rz03
disallowed_leaders cephtiebreaker-vrz0506c1n1-test-rz03
0: [v2:10.156.107.252:3300/0,v1:10.156.107.252:6789/0] mon.ceph-monc1n1-test-vrz0506; crush_location {datacenter=rz05}
1: [v2:10.156.107.98:3300/0,v1:10.156.107.98:6789/0] mon.ceph-monc1n3-test-vrz0506; crush_location {datacenter=rz06}
2: [v2:10.156.107.30:3300/0,v1:10.156.107.30:6789/0] mon.ceph-monc1n4-test-vrz0506; crush_location {datacenter=rz06}
3: [v2:10.156.107.192:3300/0,v1:10.156.107.192:6789/0] mon.ceph-monc1n2-test-vrz0506; crush_location {datacenter=rz05}
4: [v2:10.156.255.1:3300/0,v1:10.156.255.1:6789/0] mon.cephtiebreaker-vrz0506c1n1-test-rz03; crush_location {datacenter=rz03}
dumped monmap epoch 42
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx