Re: Stable erasure coding CRUSH rule for multiple hosts?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I only have one remark on your assumption regarding maintenance with your current setup. With your profile k4 m2 you'd have a min_size of 5 (k + 1 which is recommended), taking one host down would still result in IO pause because min_size is not met. To allow IO you'd need to reduce min_size to 4 which is only recommended in disaster scenarios. With three nodes you'd be better off with replication size 3, although it requires more storage, of course. Adding (or removing) OSDs always results in remapping, I don't think it's unexpected what you're describing.

Regards,
Eugen

Zitat von aschmitz <ceph-users@xxxxxxxxxxxx>:

Hi folks,

I have a small cluster of three Ceph hosts running on Pacific. I'm trying to balance resilience and disk usage, so I've set up a k=4 m=2 pool for some bulk storage on HDD devices.

With the correct placement of PGs this should allow me to take any one host offline for maintenance. I've written this CRUSH rule for that purpose:

rule erasure_k4_m2_hdd_rule {
  id 3
  type erasure
  step take default class hdd
  step choose indep 3 type host
  step chooseleaf indep 2 type osd
  step emit
}

This should pick three hosts, and then two OSDs from each, which at least ensures that no host has more than two OSDs.

This appears to work correctly, but I'm running into an odd situation when adding additional OSDs to the cluster: sometimes the hosts flip order in a PG's set, resulting in unnecessary remapping work.

For example, I have one PG that changed from OSDs [0,13,7,9,3,5] to [0,13,3,5,7,9]. (Note that the middle two and last two sets of OSDs have swapped with one another.) From a quick perusal of other PGs that are being moved, the two OSDs within a host never appear to be rearranged, but the set of hosts that are chosen may be shuffled.

Is there something I'm missing that would make this rule more stable in the face of OSD addition? (I'm wondering if the host choosing component should be "firstn" rather than "indep", even though the discussion at https://docs.ceph.com/en/latest/rados/operations/crush-map-edits/#crushmaprules implies indep is preferable in EC pools.)

I don't have current plans to expand beyond a three-host cluster, but if there's an alternative way to express "not more than two OSDs per host", that could be helpful as well.

Any insights or suggestions would be appreciated.

Thanks,
aschmitz
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux