Re: Stable erasure coding CRUSH rule for multiple hosts?

Eugen Block <eblock@xxxxxx> · Wed, 18 Jan 2023 09:45:43 +0000

Hi,

I only have one remark on your assumption regarding maintenance with  
your current setup. With your profile k4 m2 you'd have a min_size of 5  
(k + 1 which is recommended), taking one host down would still result  
in IO pause because min_size is not met. To allow IO you'd need to  
reduce min_size to 4 which is only recommended in disaster scenarios.  
With three nodes you'd be better off with replication size 3, although  
it requires more storage, of course.
Adding (or removing) OSDs always results in remapping, I don't think  
it's unexpected what you're describing.

Regards,
Eugen

Zitat von aschmitz <ceph-users@xxxxxxxxxxxx>:

Hi folks,

I have a small cluster of three Ceph hosts running on Pacific. I'm  
trying to balance resilience and disk usage, so I've set up a k=4  
m=2 pool for some bulk storage on HDD devices.

With the correct placement of PGs this should allow me to take any  
one host offline for maintenance. I've written this CRUSH rule for  
that purpose:

rule erasure_k4_m2_hdd_rule {
  id 3
  type erasure
  step take default class hdd
  step choose indep 3 type host
  step chooseleaf indep 2 type osd
  step emit
}

This should pick three hosts, and then two OSDs from each, which at  
least ensures that no host has more than two OSDs.

This appears to work correctly, but I'm running into an odd  
situation when adding additional OSDs to the cluster: sometimes the  
hosts flip order in a PG's set, resulting in unnecessary remapping  
work.

For example, I have one PG that changed from OSDs [0,13,7,9,3,5] to  
[0,13,3,5,7,9]. (Note that the middle two and last two sets of OSDs  
have swapped with one another.) From a quick perusal of other PGs  
that are being moved, the two OSDs within a host never appear to be  
rearranged, but the set of hosts that are chosen may be shuffled.

Is there something I'm missing that would make this rule more stable  
in the face of OSD addition? (I'm wondering if the host choosing  
component should be "firstn" rather than "indep", even though the  
discussion at  
https://docs.ceph.com/en/latest/rados/operations/crush-map-edits/#crushmaprules implies indep is preferable in EC  
pools.)

I don't have current plans to expand beyond a three-host cluster,  
but if there's an alternative way to express "not more than two OSDs  
per host", that could be helpful as well.

Any insights or suggestions would be appreciated.

Thanks,
aschmitz
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx