Re: Minimum amount of nodes needed for stretch mode?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


On 07-03-2024 18:16, Gregory Farnum wrote:
On Thu, Mar 7, 2024 at 9:09 AM Stefan Kooman <stefan@xxxxxx> wrote:



Failure domain considered is data center. Cluster in stretch mode [1].

- What is the minimum amount of monitor nodes (apart from tie breaker)
needed per failure domain?

You need at least two monitors per site. This is because in stretch
mode, OSDs can *only* connect to monitors in their data center. You
don't want a restarting monitor to lead to the entire cluster
switching to degraded stretch mode.

It does not seem to work like that. I mean, it's true that OSDs within that data center cannot peer or be restarted when the monitor is unavailable, but it can be offline without leading to unavailability. When I stop a monitor (does not matter which one) the cluster stays online. If one of the monitors in a data center is offline, OSDs in that datacenter that get restarted won't be able to come online anymore (as expected). Only when the whole data center is offline will stretch mode be triggered. The docs are pretty clear about this, and it's exactly what I see in practice:

"If all OSDs and monitors in one of the data centers become inaccessible at once, the surviving data center enters a “degraded stretch mode”."

It does not have to be "all at once" though. When the last remaining daemon in a failure domain goes offline degraded stretch mode gets triggered. Offline all OSDs and keep the monitor(s) alive? 100% inactive PGs as the "peer from two buckets" enforcement rule is still active (as designed).

- What is the minimum amount of storage nodes needed per failure domain?

I don't think there's anything specific here beyond supporting 2
replicas per site (so, at least two OSD hosts unless you configure
things weirdly).

lol. I personally don't think its weird: it's the counterpart of the minimal replicated 3-node cluster.

- Are device classes supported with stretch mode?

Stretch mode doesn't have any specific logic around device classes, so
if you provide CRUSH rules that work with them it should be okay? But
definitely not tested.

It does not seem to work. But I'll repeat the process once more and double check.

- is min_size = 1 in "degraded stretch mode" a hard coded requirement or
can this be changed to it at leave min_size = 2 (yes, I'm aware that no
other OSD may go down in the surviving data center or PGs will become

Hmm, I'm pretty sure this is hard coded but haven't checked in a
while. Hopefully somebody else can chime in here.

I would like this to be configurable to be honest. As it's a policy decision and not a technical requirement. Non stretch mode clusters also have the freedom of choice. If the operator values data consistency over data availability it should be honored.

I've converted a (test) 3 node replicated cluster (2 storage nodes, 1
node with monitor only, min_size=2, size=4) setup to a "stretch mode"
setup [1]. That works as expected.

CRUSH rule (adjusted to work with 1 host and 2 OSDs per device class per
data center only)

rule stretch_rule {
         id 5
         type replicated
         step take dc1
         step choose firstn 0 type host
         step chooseleaf firstn 2 type osd
         step emit
         step take dc2
         step choose firstn 0 type host
         step chooseleaf firstn 2 type osd
         step emit

The documentation seems to suggest that 2 storage nodes and 2 monitor
nodes are needed at a minimum. Is that correct? I wonder why? For a
minimal (as possible) cluster I don't see the need for one additional
monitor per datacenter. Does the tiebreaker monitor function as a normal
monitor (apart from it not allowed to become leader)?

When stretch rules with device classes are used things don't work as
expected anymore. Example crush rule:

rule stretch_rule_ssd {
         id 4
         type replicated
         step take dc1 class ssd
         step choose firstn 0 type host
         step chooseleaf firstn 2 type osd
         step emit
         step take dc2 class ssd
         step choose firstn 0 type host
         step chooseleaf firstn 2 type osd
         step emit

A similar crush rule for hdd exists. When I change the crush_rule for
one of the pools to use stretch_rule_ssd the PGs on OSDs with device
class ssd become inactive as soon as one of the data centers goes
offline (and "degraded stretched mode" has been activated, and only 1
bucket, data center, is needed for peering). I don't understand why.
Another issue with this is that as soon as the datacenter is online
again, the recovery will never finish by itself and a "ceph osd
force_healthy_stretch_mode --yes-i-really-mean-it" is needed to get

Hmm. My first guess is there's something going on with the shadow
CRUSH tree from device classes, but somebody would need to dig into
that. I think you should file a ticket.

I will do some more testing and file a ticket if I'm confident it should work but doesn't.


Gr. Stefan
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]

  Powered by Linux