Re: Minimum amount of nodes needed for stretch mode?

Stefan Kooman <stefan@xxxxxx> · Thu, 7 Mar 2024 20:09:01 +0100

On 07-03-2024 18:16, Gregory Farnum wrote:
On Thu, Mar 7, 2024 at 9:09 AM Stefan Kooman <stefan@xxxxxx> wrote:

Hi,

TL;DR

Failure domain considered is data center. Cluster in stretch mode [1].

- What is the minimum amount of monitor nodes (apart from tie breaker)
needed per failure domain?

You need at least two monitors per site. This is because in stretch
mode, OSDs can *only* connect to monitors in their data center. You
don't want a restarting monitor to lead to the entire cluster
switching to degraded stretch mode.

It does not seem to work like that. I mean, it's true that OSDs within 
that data center cannot peer or be restarted when the monitor is 
unavailable, but it can be offline without leading to unavailability. 
When I stop a monitor (does not matter which one) the cluster stays 
online. If one of the monitors in a data center is offline, OSDs in that 
datacenter that get restarted won't be able to come online anymore (as 
expected). Only when the whole data center is offline will stretch mode 
be triggered. The docs are pretty clear about this, and it's exactly 
what I see in practice:

"If all OSDs and monitors in one of the data centers become inaccessible 
at once, the surviving data center enters a “degraded stretch mode”."

It does not have to be "all at once" though. When the last remaining 
daemon in a failure domain goes offline degraded stretch mode gets 
triggered. Offline all OSDs and keep the monitor(s) alive? 100% inactive 
PGs as the "peer from two buckets" enforcement rule is still active (as 
designed).

- What is the minimum amount of storage nodes needed per failure domain?

I don't think there's anything specific here beyond supporting 2
replicas per site (so, at least two OSD hosts unless you configure
things weirdly).

lol. I personally don't think its weird: it's the counterpart of the 
minimal replicated 3-node cluster.

- Are device classes supported with stretch mode?

Stretch mode doesn't have any specific logic around device classes, so
if you provide CRUSH rules that work with them it should be okay? But
definitely not tested.

It does not seem to work. But I'll repeat the process once more and 
double check.

- is min_size = 1 in "degraded stretch mode" a hard coded requirement or
can this be changed to it at leave min_size = 2 (yes, I'm aware that no
other OSD may go down in the surviving data center or PGs will become
unavailable).

Hmm, I'm pretty sure this is hard coded but haven't checked in a
while. Hopefully somebody else can chime in here.

I would like this to be configurable to be honest. As it's a policy 
decision and not a technical requirement. Non stretch mode clusters also 
have the freedom of choice. If the operator values data consistency over 
data availability it should be honored.

I've converted a (test) 3 node replicated cluster (2 storage nodes, 1
node with monitor only, min_size=2, size=4) setup to a "stretch mode"
setup [1]. That works as expected.

CRUSH rule (adjusted to work with 1 host and 2 OSDs per device class per
data center only)

rule stretch_rule {
         id 5
         type replicated
         step take dc1
         step choose firstn 0 type host
         step chooseleaf firstn 2 type osd
         step emit
         step take dc2
         step choose firstn 0 type host
         step chooseleaf firstn 2 type osd
         step emit
}

The documentation seems to suggest that 2 storage nodes and 2 monitor
nodes are needed at a minimum. Is that correct? I wonder why? For a
minimal (as possible) cluster I don't see the need for one additional
monitor per datacenter. Does the tiebreaker monitor function as a normal
monitor (apart from it not allowed to become leader)?

When stretch rules with device classes are used things don't work as
expected anymore. Example crush rule:

rule stretch_rule_ssd {
         id 4
         type replicated
         step take dc1 class ssd
         step choose firstn 0 type host
         step chooseleaf firstn 2 type osd
         step emit
         step take dc2 class ssd
         step choose firstn 0 type host
         step chooseleaf firstn 2 type osd
         step emit
}

A similar crush rule for hdd exists. When I change the crush_rule for
one of the pools to use stretch_rule_ssd the PGs on OSDs with device
class ssd become inactive as soon as one of the data centers goes
offline (and "degraded stretched mode" has been activated, and only 1
bucket, data center, is needed for peering). I don't understand why.
Another issue with this is that as soon as the datacenter is online
again, the recovery will never finish by itself and a "ceph osd
force_healthy_stretch_mode --yes-i-really-mean-it" is needed to get
HEALTH_OK

Hmm. My first guess is there's something going on with the shadow
CRUSH tree from device classes, but somebody would need to dig into
that. I think you should file a ticket.

I will do some more testing and file a ticket if I'm confident it should 
work but doesn't.

Thanks!

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx