EC pool degrades when adding device-class to crush rule

Lars Fenneberg <lf@xxxxxxxxxxxxx> · Thu, 17 Aug 2023 17:44:02 +0200

Hi all!

The cluster was installed before device classes were a thing so as a
prepartion to install some SSDs into a Ceph cluster with OSDs on 7 maschines
I migrated all replicated pools to a CRUSH rule with device-class set.  Lots
of misplaced objects (probably because of changed ids in the CRUSH tree) but
the cluster was HEALTH_OK the whole time.

I'm now testing the migration of a 4+2 erasure coded pool to also include
the device-class. Here are the old and new CRUSH rules:

# Old
rule default.rgw.buckets.data {
        id 3
        type erasure
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default
        step chooseleaf indep 0 type host
        step emit
}

# New
rule default.rgw.buckets.data_hdd {
        id 4
        type erasure
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default class hdd
        step chooseleaf indep 0 type host
        step emit
}

After changing the rule with

ceph osd pool set default.rgw.buckets.data crush_rule default.rgw.buckets.data_hdd

I'm getting quite a few degraded placement groups and I'm not sure if I still have
a redundant setup or not during the migration:

[root@master1 tmp]# ceph version
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

[root@master1 tmp]# ceph -s
  cluster:
    id:     3b46f93c-788a-11e9-bc8c-bcaec503b525
    health: HEALTH_WARN
            Degraded data redundancy: 4903/15474 objects degraded (31.685%), 12 pgs degraded

  services:
    mon: 5 daemons, quorum master1.dev,master2.dev,master3.dev,master4.dev,master5.dev (age 10m)
    mgr: master3.dev(active, since 8m), standbys: master5.dev
    mds: 1/1 daemons up, 1 standby
    osd: 14 osds: 14 up (since 8m), 14 in (since 9m)
    rgw: 4 daemons active (4 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   21 pools, 586 pgs
    objects: 2.70k objects, 365 MiB
    usage:   281 GiB used, 279 GiB / 560 GiB avail
    pgs:     4903/15474 objects degraded (31.685%)
             7054/15474 objects misplaced (45.586%)
             554 active+clean
             20  active+recovering
             12  active+recovery_wait+degraded

  io:
    recovery: 545 KiB/s, 18 objects/s

[root@master1 tmp]# ceph health detail
HEALTH_WARN Degraded data redundancy: 4903/15474 objects degraded (31.685%), 12 pgs degraded
[WRN] PG_DEGRADED: Degraded data redundancy: 4903/15474 objects degraded (31.685%), 12 pgs degraded
    pg 19.0 is active+recovery_wait+degraded, acting [9,2,3,13,11,5]
    pg 19.2 is active+recovery_wait+degraded, acting [2,6,0,11,7,12]
    pg 19.5 is active+recovery_wait+degraded, acting [5,1,0,11,4,12]
    pg 19.6 is active+recovery_wait+degraded, acting [1,6,3,8,7,11]
    pg 19.7 is active+recovery_wait+degraded, acting [13,6,10,3,7,8]
    pg 19.b is active+recovery_wait+degraded, acting [4,1,13,6,11,9]
    pg 19.c is active+recovery_wait+degraded, acting [8,6,11,7,13,2]
    pg 19.e is active+recovery_wait+degraded, acting [6,8,0,11,12,4]
    pg 19.f is active+recovery_wait+degraded, acting [5,10,13,1,0,4]
    pg 19.10 is active+recovery_wait+degraded, acting [3,11,5,9,7,1]
    pg 19.1a is active+recovery_wait+degraded, acting [7,0,8,1,11,5]
    pg 19.1f is active+recovery_wait+degraded, acting [7,5,8,3,1,11]

[root@master1 tmp]# ceph osd pool list detail | grep default.rgw.buckets.data
pool 19 'default.rgw.buckets.data' erasure profile default.rgw.buckets.data size 6 min_size 5 crush_rule 4 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 174 flags hashpspool stripe_width 16384 pg_num_min 32 target_size_ratio 69.8 application rgw

The output of ceph health detail looks to me as if all the six chunks of the
erasure coded data are still there as expected.  

But why are these placement groups shown as degraded?  I'd expect only
misplaced object chunks in this scenario the same as with replicated pools.

Is there actually any reduction in redundancy or not?

The cluster recovers after some time.  I've also tested with more (256)
placement groups but the outcome is the same.

Thanks,
LF.
-- 
Lars Fenneberg, lf@xxxxxxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx