Stable erasure coding CRUSH rule for multiple hosts?

aschmitz <ceph-users@xxxxxxxxxxxx> · Tue, 17 Jan 2023 22:09:29 -0600

Hi folks,

I have a small cluster of three Ceph hosts running on Pacific. I'm 
trying to balance resilience and disk usage, so I've set up a k=4 m=2 
pool for some bulk storage on HDD devices.

With the correct placement of PGs this should allow me to take any one 
host offline for maintenance. I've written this CRUSH rule for that purpose:

rule erasure_k4_m2_hdd_rule {
  id 3
  type erasure
  step take default class hdd
  step choose indep 3 type host
  step chooseleaf indep 2 type osd
  step emit
}

This should pick three hosts, and then two OSDs from each, which at 
least ensures that no host has more than two OSDs.

This appears to work correctly, but I'm running into an odd situation 
when adding additional OSDs to the cluster: sometimes the hosts flip 
order in a PG's set, resulting in unnecessary remapping work.

For example, I have one PG that changed from OSDs [0,13,7,9,3,5] to 
[0,13,3,5,7,9]. (Note that the middle two and last two sets of OSDs have 
swapped with one another.) From a quick perusal of other PGs that are 
being moved, the two OSDs within a host never appear to be rearranged, 
but the set of hosts that are chosen may be shuffled.

Is there something I'm missing that would make this rule more stable in 
the face of OSD addition? (I'm wondering if the host choosing component 
should be "firstn" rather than "indep", even though the discussion at 
https://docs.ceph.com/en/latest/rados/operations/crush-map-edits/#crushmaprules 
implies indep is preferable in EC pools.)

I don't have current plans to expand beyond a three-host cluster, but if 
there's an alternative way to express "not more than two OSDs per host", 
that could be helpful as well.

Any insights or suggestions would be appreciated.

Thanks,
aschmitz
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx