Hi, We have a test cluster for evaluating stretch mode and we are running into an issue where the monitors fail to elect a leader after a network split, even though all monitors are online and can reach each other. I saw a presentation by Gregory Farnum from FOSDEM 2020 about stretch clusters and his explantation of the connectivity strategy sounds like this case should not happen, so I've put him on CC in case he can share more details about this process. Our cluster consists of 3 data centers, with a total of 5 monitor nodes and 4 osd nodes. Two data centers each have 2 monitors and 2 osd nodes, while the third data center provides a tiebreaker monitor. We collected some debugging information with the following commands: - `ceph daemon /var/run/ceph/ceph-mon.$(hostname).asok mon_status` for the "mon rank". - `ceph daemon /var/run/ceph/ceph-mon.$(hostname).asok connection scores dump` for the "connectivity rank". During the tests we used `watch -n1` to monitor the output of these commands. This also means we may be missing information that changes too quickly. The issue we encounter happens when the following events occur: 1. All nodes are online, the monitors have monitor and connectivity ranks 0 to 4. Each occurs exactly once and the monitor rank of a node matches the connectivity rank of that node. I will refer to these nodes as nodes A to E now. So node A has rank 0, B has 1, C 2, D 3, E 4. 2. One data center with nodes B and C goes offline. We simulate this by running `ip route add blackhole $ip` on all machines in that one data center to block all IPs of the other nodes from the other data centers. We do this on monitor and OSD nodes at the "same" time (like within one second). 3. The surviving nodes create a new quorum between nodes A, D and E. So ranks 0, 3 and 4 have survived. Their connectivity scores change to 0, 1, 2, but their monitor rank stays the same. They go to "leader" and "peon" in the monitor status. `ceph status` also shows that stretch mode has detected a data center failure. 3.1. The monitors B and C in the offlined data center retain their monitor and connectivity ranks of 1 and 2, but are stuck in "probing" (I think) or "electing" state in the monitor status. I'm not 100% sure which state they were in, but if that matters I can retest. They were certainly not part of the quorum. 4. We restore the connection of the offline data center again by removing the blackhole routes. 5. The nodes momentarily managed to create a quorum, but it collapsed within seconds and we are not quite sure if the quorum contained all nodes or just a subset of nodes. 6. The quorum collapses and the connectivity rank of nodes B to E changes to 1. Their monitor ranks are still 1 to 4. The monitor and connectivity rank of node A is still 0. 7. All monitors are stuck in "electing" state in the monitor status for at least several minutes. We've seen this happen for hours before, but back then we hadn't yet analysed it in this detail. After reproducing it now, it stayed like this for at least 5 minutes. During that time, the "epoch" in the monitor status increases by 2 every (roughly wall timed) 5 seconds. Nothing else happens. We believe that the connectivity strategy is unable to create a quorum due to only nodes 0 and 1 being supposedly online. We believe that ceph does not notice that the connectivity rank and the actual monitor rank of a node differ before the connectivity data is sent to other nodes. Those nodes simply collect all the data from nodes B to E under the data for rank 1. Thus rank 2 to 4 are seen as offline, but at least three nodes are required to build a quorum since our cluster contains 5 monitor nodes. 8. We restart one node where the connectivity rank and the monitor rank mismatch. In this case we decided to use node D with monitor rank 3 and connectivity rank 1. After the restart it changed the connectivity rank to 3 as well. Now the cluster is able to find 3 nodes again to build a quorum and after a few seconds all 5 nodes join a working quorum, even though nodes C and E still show connectivity rank 1, together with node B. Node B also shows monitor rank 1 so connectivity rank 1 sounds correct here. It appears that the connectivity rank can only be decreased and never increased again unless the monitor process is restarted. When failovers happen and the quorum is reduced, the rank reduces and eventually the cluster enters a state where the connectivity ranks are too low to support the quorum requirement. During brief testing, we were unable to reproduce this issue by simply taking monitors offline, without also taking their osd nodes offline as well. This may thus be related to special strech mode handling when an entire data center fails. I believe this is not supposed to happen and the cluster should be able to 1) recover completely after the data center comes back online and 2) the connectivity rank should match the monitor rank of each node. I hope I've described the problem sufficiently detailed so that you can reproduce it. If not, please do feel free to reach out. Florian PS: Resetting the connectity scores with `ceph daemon mon.{name} connection scores reset` on a single node does not change the connectivity rank and the problem persists. Restarting the monitor process on that same node resolves the issue as described above. PPS: The output of `connection scores dump` is not a proper JSON document so parsing it with tools like jq drops a lot of information from the reports array, which is actually a werid object instead of an array. Just in case, our monmap looks like this: > ceph mon dump epoch 42 fsid cd30ad06-b025-40f5-9b5e-767087e8a955 last_changed 2022-03-10T17:13:42.783584+0100 created 2022-02-01T18:16:40.188610+0100 min_mon_release 16 (pacific) election_strategy: 3 stretch_mode_enabled 1 tiebreaker_mon cephtiebreaker-vrz0506c1n1-test-rz03 disallowed_leaders cephtiebreaker-vrz0506c1n1-test-rz03 0: [v2:10.156.107.252:3300/0,v1:10.156.107.252:6789/0] mon.ceph-monc1n1-test-vrz0506; crush_location {datacenter=rz05} 1: [v2:10.156.107.98:3300/0,v1:10.156.107.98:6789/0] mon.ceph-monc1n3-test-vrz0506; crush_location {datacenter=rz06} 2: [v2:10.156.107.30:3300/0,v1:10.156.107.30:6789/0] mon.ceph-monc1n4-test-vrz0506; crush_location {datacenter=rz06} 3: [v2:10.156.107.192:3300/0,v1:10.156.107.192:6789/0] mon.ceph-monc1n2-test-vrz0506; crush_location {datacenter=rz05} 4: [v2:10.156.255.1:3300/0,v1:10.156.255.1:6789/0] mon.cephtiebreaker-vrz0506c1n1-test-rz03; crush_location {datacenter=rz03} dumped monmap epoch 42
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx