Adding Rack to crushmap - Rebalancing multiple PB of data - advice/experience

Kasper Rasmussen <kasper_steengaard@xxxxxxxxxxx> · Fri, 17 Jan 2025 11:02:14 +0000

I'm managing a ceph cluster with +1K OSDs distributed accross 56 host.
Untill now the crush rule used is the default replicated rule, but I want to change that in order to implement failure domain on rack level.

Ceph version: Pacific 16.2.15
All pools(RBD and CephFS) currently use the default replicated_rule
All OSD hosts has 25G network, and spinning disks(HDD)
MON DBs is on NVMEs
Workload is 24x7x365

The builtin balancer is disabled, and has been for a long time -
Instead balancing has been done by a cron job executing - ceph osd reweight-by-utilization 112 0.05 30

Current plan is to

  *
Disable rebalancing and backfilling by executing - ceph osd set norebalance; ceph osd set nobackfill;
  *
Add all 7 Rack to crushmap and distribute the hosts (8 in each.) by using the built in commands like -
  *
ceph osd crush add-bucket rack<#> rack root=default
  *
ceph osd crush move osd-host<#> rack=rack<#>
  *
Create the new rack split rule with command
  *
ceph osd crush rule create-replicated rack_split default rack
  *
Set the rule across all my pools
  *
for p in $(ceph osd lspools | cut -d' ' -f 2) ; do echo $p $(ceph osd pool set $p crush_rule rack_split) ; done
  *
I will probably also be using upmap-remapped.py here..
  *
Finally enable rebalancing, and backfilling - ceph osd unset norebalance; ceph osd unset nobackfill;

However I'm concerned with the amount of data that needs to be rebalanced, since the cluster holds multiple PB, and I'm looking for review of/input for my plan, as well as words of advice/experience from someone who has been in a similar situation.

Also I've seen some weird behavior - where Pacific(16) seems to do something different from Quincy(17) -
In a test cluster I've tested the plan -
On Pasific: Data is marked as "degraded", and not misplaced as expected. I also see above 2000% degraded data (but that might be another issue)

On Quincy: Data is marked as misplaced - which seems correct.

All experience and/or input will be greatly appreciated.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx