Balancer problems with Erasure Coded pool

Thomas Hukkelberg <thomas@xxxxxxxxxxxxxxxxx> · Mon, 27 Jun 2022 18:20:04 +0000

Hi all!

Hope some of you can shed some light on this. We have problems balancing data in one of our clusters. When using the balancer module we get over 10% variance between fullest and emptiest OSD. We believe the problem is one of our erasure coded pools. The Ceph docs is a bit sparse when it comes to erasure coded crush maps utilizing multiple levels of failure domains — but we tested this setup on a test cluster and verified that it worked as expected before putting it into production. However, the problem now is that when trying to do manual pg upmaps the following is refused:

# ceph osd pg-upmap-items 36.71 260 60
set 36.71 pg_upmap_items mapping to [260->60]

Monitor log reports following:
2022-06-27T11:48:03.789+0200 7f93600b0700 -1 verify_upmap number of buckets 6 exceeds desired 2
2022-06-27T11:48:03.789+0200 7f93600b0700  0 check_pg_upmaps verify_upmap of pg 36.71 returning -22

This error message makes us wonder if there is a problem with the crush rule. See below for current crush rule+tree. The command above references osd.60 which is in host hk-cephnode-56 and osd.260 which is in host hk-cephnode-68. Both are in the same rack and therefore we believe that the pg movement should be valid?!

The idea is to have our failure domain across pods (switches), then racks, and lastly hosts, but no more than one PG pr rack when using EC4+2. That way we can sustain failure on any pod, any two racks, any two hosts and any two osds.

We believe the last 'chooseleaf_indep' num should be changed, but to what? Also we struggle to understand if that change will allow more than one PG pr rack. If so, how can we achieve better PG distribution while keeping our failure domain constraints? Or is the problem something else?

Any thoughts?

--thomas

---

# ceph osd crush rule dump hk_pod_hdd_ec42_isa_cauchy
{
    "rule_id": 9,
    "rule_name": "hk_pod_hdd_ec42_isa_cauchy",
    "ruleset": 9,
    "type": 3,
    "min_size": 4,
    "max_size": 6,
    "steps": [
        {
            "op": "set_chooseleaf_tries",
            "num": 5
        },
        {
            "op": "set_choose_tries",
            "num": 100
        },
        {
            "op": "take",
            "item": -59,
            "item_name": "default~hdd"
        },
        {
            "op": "choose_indep",
            "num": 3,
            "type": "pod"
        },
        {
            "op": "choose_indep",
            "num": 2,
            "type": "rack"
        },
        {
            "op": "chooseleaf_indep",
            "num": 1,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

---

# ceph osd erasure-code-profile get hk_ec42_isa_cauchy
crush-device-class=hdd
crush-failure-domain=host
crush-root=default
k=4
m=2
plugin=isa
technique=cauchy

---
# ceph osd tree | grep -v osd
ID    CLASS  WEIGHT      TYPE NAME ...
  -1         3259.98901  root default 
  -2         3259.98901      datacenter hk-datacenter-01 
-129         1086.66309          pod hk-sw-21 
 -12          543.33154              rack hk-rack-02 
 -44          181.11050                  host hk-cephnode-51 
 -70          181.11050                  host hk-cephnode-57 
-141          181.11050                  host hk-cephnode-63 
 -15          543.33154              rack hk-rack-05 
-115          181.11050                  host hk-cephnode-54 
-106          181.11050                  host hk-cephnode-60 
-157          181.11050                  host hk-cephnode-66 
-130         1086.66309          pod hk-sw-22 
 -13          543.33154              rack hk-rack-03 
 -96          181.11050                  host hk-cephnode-52 
 -75          181.11050                  host hk-cephnode-58 
-145          181.11050                  host hk-cephnode-64 
 -16          543.33154              rack hk-rack-06 
-116          181.11050                  host hk-cephnode-55 
-110          181.11050                  host hk-cephnode-61 
-153          181.11050                  host hk-cephnode-67 
-131         1086.66309          pod hk-sw-23 
 -14          543.33154              rack hk-rack-04 
-100          181.11050                  host hk-cephnode-53 
-102          181.11050                  host hk-cephnode-59 
-149          181.11050                  host hk-cephnode-65 
 -17          543.33154              rack hk-rack-07 
-117          181.11050                  host hk-cephnode-56 
-125          181.11050                  host hk-cephnode-62 
-161          181.11050                  host hk-cephnode-68 

---

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx