Re: Minimum amount of nodes needed for stretch mode?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 7 Mar 2024 09:16:26 -0800

On Thu, Mar 7, 2024 at 9:09 AM Stefan Kooman <stefan@xxxxxx> wrote:
>
> Hi,
>
> TL;DR
>
> Failure domain considered is data center. Cluster in stretch mode [1].
>
> - What is the minimum amount of monitor nodes (apart from tie breaker)
> needed per failure domain?

You need at least two monitors per site. This is because in stretch
mode, OSDs can *only* connect to monitors in their data center. You
don't want a restarting monitor to lead to the entire cluster
switching to degraded stretch mode.

>
> - What is the minimum amount of storage nodes needed per failure domain?

I don't think there's anything specific here beyond supporting 2
replicas per site (so, at least two OSD hosts unless you configure
things weirdly).

> - Are device classes supported with stretch mode?

Stretch mode doesn't have any specific logic around device classes, so
if you provide CRUSH rules that work with them it should be okay? But
definitely not tested.

> - is min_size = 1 in "degraded stretch mode" a hard coded requirement or
> can this be changed to it at leave min_size = 2 (yes, I'm aware that no
> other OSD may go down in the surviving data center or PGs will become
> unavailable).

Hmm, I'm pretty sure this is hard coded but haven't checked in a
while. Hopefully somebody else can chime in here.

> I've converted a (test) 3 node replicated cluster (2 storage nodes, 1
> node with monitor only, min_size=2, size=4) setup to a "stretch mode"
> setup [1]. That works as expected.
>
> CRUSH rule (adjusted to work with 1 host and 2 OSDs per device class per
> data center only)
>
> rule stretch_rule {
>         id 5
>         type replicated
>         step take dc1
>         step choose firstn 0 type host
>         step chooseleaf firstn 2 type osd
>         step emit
>         step take dc2
>         step choose firstn 0 type host
>         step chooseleaf firstn 2 type osd
>         step emit
> }
>
> The documentation seems to suggest that 2 storage nodes and 2 monitor
> nodes are needed at a minimum. Is that correct? I wonder why? For a
> minimal (as possible) cluster I don't see the need for one additional
> monitor per datacenter. Does the tiebreaker monitor function as a normal
> monitor (apart from it not allowed to become leader)?
>
> When stretch rules with device classes are used things don't work as
> expected anymore. Example crush rule:
>
>
> rule stretch_rule_ssd {
>         id 4
>         type replicated
>         step take dc1 class ssd
>         step choose firstn 0 type host
>         step chooseleaf firstn 2 type osd
>         step emit
>         step take dc2 class ssd
>         step choose firstn 0 type host
>         step chooseleaf firstn 2 type osd
>         step emit
> }
>
> A similar crush rule for hdd exists. When I change the crush_rule for
> one of the pools to use stretch_rule_ssd the PGs on OSDs with device
> class ssd become inactive as soon as one of the data centers goes
> offline (and "degraded stretched mode" has been activated, and only 1
> bucket, data center, is needed for peering). I don't understand why.
> Another issue with this is that as soon as the datacenter is online
> again, the recovery will never finish by itself and a "ceph osd
> force_healthy_stretch_mode --yes-i-really-mean-it" is needed to get
> HEALTH_OK

Hmm. My first guess is there's something going on with the shadow
CRUSH tree from device classes, but somebody would need to dig into
that. I think you should file a ticket.
-Greg

>
> Can anyone explain to me why this is?
>
> Gr. Stefan
>
> [1]: https://docs.ceph.com/en/latest/rados/operations/stretch-mode/
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx