Hi all! Hope some of you can shed some light on this. We have problems balancing data in one of our clusters. When using the balancer module we get over 10% variance between fullest and emptiest OSD. We believe the problem is one of our erasure coded pools. The Ceph docs is a bit sparse when it comes to erasure coded crush maps utilizing multiple levels of failure domains — but we tested this setup on a test cluster and verified that it worked as expected before putting it into production. However, the problem now is that when trying to do manual pg upmaps the following is refused: # ceph osd pg-upmap-items 36.71 260 60 set 36.71 pg_upmap_items mapping to [260->60] Monitor log reports following: 2022-06-27T11:48:03.789+0200 7f93600b0700 -1 verify_upmap number of buckets 6 exceeds desired 2 2022-06-27T11:48:03.789+0200 7f93600b0700 0 check_pg_upmaps verify_upmap of pg 36.71 returning -22 This error message makes us wonder if there is a problem with the crush rule. See below for current crush rule+tree. The command above references osd.60 which is in host hk-cephnode-56 and osd.260 which is in host hk-cephnode-68. Both are in the same rack and therefore we believe that the pg movement should be valid?! The idea is to have our failure domain across pods (switches), then racks, and lastly hosts, but no more than one PG pr rack when using EC4+2. That way we can sustain failure on any pod, any two racks, any two hosts and any two osds. We believe the last 'chooseleaf_indep' num should be changed, but to what? Also we struggle to understand if that change will allow more than one PG pr rack. If so, how can we achieve better PG distribution while keeping our failure domain constraints? Or is the problem something else? Any thoughts? --thomas --- # ceph osd crush rule dump hk_pod_hdd_ec42_isa_cauchy { "rule_id": 9, "rule_name": "hk_pod_hdd_ec42_isa_cauchy", "ruleset": 9, "type": 3, "min_size": 4, "max_size": 6, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -59, "item_name": "default~hdd" }, { "op": "choose_indep", "num": 3, "type": "pod" }, { "op": "choose_indep", "num": 2, "type": "rack" }, { "op": "chooseleaf_indep", "num": 1, "type": "host" }, { "op": "emit" } ] } --- # ceph osd erasure-code-profile get hk_ec42_isa_cauchy crush-device-class=hdd crush-failure-domain=host crush-root=default k=4 m=2 plugin=isa technique=cauchy --- # ceph osd tree | grep -v osd ID CLASS WEIGHT TYPE NAME ... -1 3259.98901 root default -2 3259.98901 datacenter hk-datacenter-01 -129 1086.66309 pod hk-sw-21 -12 543.33154 rack hk-rack-02 -44 181.11050 host hk-cephnode-51 -70 181.11050 host hk-cephnode-57 -141 181.11050 host hk-cephnode-63 -15 543.33154 rack hk-rack-05 -115 181.11050 host hk-cephnode-54 -106 181.11050 host hk-cephnode-60 -157 181.11050 host hk-cephnode-66 -130 1086.66309 pod hk-sw-22 -13 543.33154 rack hk-rack-03 -96 181.11050 host hk-cephnode-52 -75 181.11050 host hk-cephnode-58 -145 181.11050 host hk-cephnode-64 -16 543.33154 rack hk-rack-06 -116 181.11050 host hk-cephnode-55 -110 181.11050 host hk-cephnode-61 -153 181.11050 host hk-cephnode-67 -131 1086.66309 pod hk-sw-23 -14 543.33154 rack hk-rack-04 -100 181.11050 host hk-cephnode-53 -102 181.11050 host hk-cephnode-59 -149 181.11050 host hk-cephnode-65 -17 543.33154 rack hk-rack-07 -117 181.11050 host hk-cephnode-56 -125 181.11050 host hk-cephnode-62 -161 181.11050 host hk-cephnode-68 --- _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx