Re: Setting temporary CRUSH "constraint" for planned cross-datacenter downtime

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Wed, 6 Nov 2024 12:28:21 +0100 (CET)

Hi Niklas,

To explain the 33% misplaced objects after you move a host to another DC, one would have to check the current crush rule (ceph osd getcrushmap | crushtool -d -) and to which OSDs PGs are mapped to before and after the move operation (ceph pg dump).

Regarding the replicated crush rule that Ceph creates by default, the rule puts a replica on every host under the 'default' root (failure domain being host).

# rules
rule replicated_rule {
	id 0
	type replicated
	step take default
	step chooseleaf firstn 0 type host
	step emit
}

If you're using size 3 and min_size 2, and want to make sure your cluster continues to serve IOs with 2 datacenters being down then you need to make sure that these 2 datacenters only host 1 replica. 
You could group these 2 datacenters in 1 zone and all other datacenters in another zone and distribute 1 replica in zone 1 and the 2 other replicas in any of the datacenters of the other zone.

For example:

root default
    region FSN
        zone FSN1
            datacenter FSN1-DC1
                host machine-1
                    osd.0
                    ... 10 OSDs per datacenter
                ... currently 1 machine per datacenter
            datacenter FSN1-DC2
                host machine-2
                    ...
        zone FSN2        
            ... other 8 datacenters

Then create a new rule as per below:

# rules
rule replicated_zone {
	id 1
	type replicated
	step take FSN1
	step chooseleaf firstn 1 type datacenter
	step emit
	step take FSN2
	step chooseleaf firstn 2 type datacenter
	step emit
}

Then you'd just have to change the pool(s) crush rule and wait for data movement.

ceph osd pool set rbd_zone crush_rule replicated_zone

Note that you can use crushtool [1] to simulate PG mapping and check the new crush rule before applying it to any pools.

Regards,
Frédéric.

[1] https://docs.ceph.com/en/reef/man/8/crushtool/

----- Le 4 Nov 24, à 17:01, Niklas Hambüchen mail@xxxxxx a écrit :

> Hi Joachim,
> 
> I'm currently looking for the general methodology and if it's possible without
> rebalancing everything.
> 
> But of course I'd also appreciate tips directly for my deployment; here is the
> info:
> 
> Ceph 18, Simple 3-replication (osd_pool_default_size = 3, default CRUSH rules
> Ceph creates for that).
> 
> Failure domains from `ceph osd tree`:
> 
> root default
>    region FSN
>        zone FSN1
>            datacenter FSN1-DC1
>                host machine-1
>                    osd.0
>                    ... 10 OSDs per datacenter
>                ... currently 1 machine per datacenter
>            datacenter FSN1-DC2
>                host machine-2
>                    ...
>            ... currently 8 datacenters
> 
> I already tried simply
> 
>    ceph osd crush move machine-1 datacenter=FSN1-DC2
> 
> to "simulate" that DC1 and DC2 are temporarily the same failure domain
> (machine-1 is the only machine in DC1 currently), but that immediately causes
> 33% of objects to be misplaced -- much more movement than I'd hope for and more
> than would be needed (I'd expect 12.5% would need to be moved given that 1 out
> of 8 DCs needs to be moved).
> 
> Thanks!
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx