Re: EC pool degrades when adding device-class to crush rule

Eugen Block <eblock@xxxxxx> · Mon, 21 Aug 2023 08:42:47 +0000

Hi,

I tried to find an older thread that explained this quite well, maybe  
my google foo left me...
Anyway, the docs [1] explain the "degraded" state of a PG:

When a client writes an object to the primary OSD, the primary OSD  
is responsible for writing the replicas to the replica OSDs. After  
the primary OSD writes the object to storage, the PG will remain in  
a degraded state until the primary OSD has received an  
acknowledgement from the replica OSDs that Ceph created the replica  
objects successfully.

Applying that to your situation where PGs are moved across nodes  
(well, they're not moved but recreated) it can take quite some time  
until they become fully available, depending on the PG size and the  
number of objects in it. So unless you don't have "inactive PGs"  
you're fine in a degraded state as long as it resolves eventually.  
Having degraded PGs is nothing unusual, e. g. during maintenance when  
a server is rebooted.

I'm not sure if this is the same explanation as in the thread I was  
hoping to find, but I hope it answers the question. Please correct me  
if I'm wrong, though.

Regards,
Eugen

[1]  
https://docs.ceph.com/en/latest/rados/operations/monitoring-osd-pg/#degraded

Zitat von Lars Fenneberg <lf@xxxxxxxxxxxxx>:

Hi all!

The cluster was installed before device classes were a thing so as a
prepartion to install some SSDs into a Ceph cluster with OSDs on 7 maschines
I migrated all replicated pools to a CRUSH rule with device-class set.  Lots
of misplaced objects (probably because of changed ids in the CRUSH tree) but
the cluster was HEALTH_OK the whole time.

I'm now testing the migration of a 4+2 erasure coded pool to also include
the device-class. Here are the old and new CRUSH rules:

# Old
rule default.rgw.buckets.data {
        id 3
        type erasure
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default
        step chooseleaf indep 0 type host
        step emit
}

# New
rule default.rgw.buckets.data_hdd {
        id 4
        type erasure
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default class hdd
        step chooseleaf indep 0 type host
        step emit
}

After changing the rule with

ceph osd pool set default.rgw.buckets.data crush_rule  
default.rgw.buckets.data_hdd

I'm getting quite a few degraded placement groups and I'm not sure  
if I still have
a redundant setup or not during the migration:

[root@master1 tmp]# ceph version
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5)  
quincy (stable)

[root@master1 tmp]# ceph -s
  cluster:
    id:     3b46f93c-788a-11e9-bc8c-bcaec503b525
    health: HEALTH_WARN
            Degraded data redundancy: 4903/15474 objects degraded  
(31.685%), 12 pgs degraded

  services:
    mon: 5 daemons, quorum  
master1.dev,master2.dev,master3.dev,master4.dev,master5.dev (age 10m)
    mgr: master3.dev(active, since 8m), standbys: master5.dev
    mds: 1/1 daemons up, 1 standby
    osd: 14 osds: 14 up (since 8m), 14 in (since 9m)
    rgw: 4 daemons active (4 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   21 pools, 586 pgs
    objects: 2.70k objects, 365 MiB
    usage:   281 GiB used, 279 GiB / 560 GiB avail
    pgs:     4903/15474 objects degraded (31.685%)
             7054/15474 objects misplaced (45.586%)
             554 active+clean
             20  active+recovering
             12  active+recovery_wait+degraded

  io:
    recovery: 545 KiB/s, 18 objects/s

[root@master1 tmp]# ceph health detail
HEALTH_WARN Degraded data redundancy: 4903/15474 objects degraded  
(31.685%), 12 pgs degraded
[WRN] PG_DEGRADED: Degraded data redundancy: 4903/15474 objects  
degraded (31.685%), 12 pgs degraded
    pg 19.0 is active+recovery_wait+degraded, acting [9,2,3,13,11,5]
    pg 19.2 is active+recovery_wait+degraded, acting [2,6,0,11,7,12]
    pg 19.5 is active+recovery_wait+degraded, acting [5,1,0,11,4,12]
    pg 19.6 is active+recovery_wait+degraded, acting [1,6,3,8,7,11]
    pg 19.7 is active+recovery_wait+degraded, acting [13,6,10,3,7,8]
    pg 19.b is active+recovery_wait+degraded, acting [4,1,13,6,11,9]
    pg 19.c is active+recovery_wait+degraded, acting [8,6,11,7,13,2]
    pg 19.e is active+recovery_wait+degraded, acting [6,8,0,11,12,4]
    pg 19.f is active+recovery_wait+degraded, acting [5,10,13,1,0,4]
    pg 19.10 is active+recovery_wait+degraded, acting [3,11,5,9,7,1]
    pg 19.1a is active+recovery_wait+degraded, acting [7,0,8,1,11,5]
    pg 19.1f is active+recovery_wait+degraded, acting [7,5,8,3,1,11]

[root@master1 tmp]# ceph osd pool list detail | grep default.rgw.buckets.data
pool 19 'default.rgw.buckets.data' erasure profile  
default.rgw.buckets.data size 6 min_size 5 crush_rule 4 object_hash  
rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 174  
flags hashpspool stripe_width 16384 pg_num_min 32 target_size_ratio  
69.8 application rgw

The output of ceph health detail looks to me as if all the six chunks of the
erasure coded data are still there as expected.

But why are these placement groups shown as degraded?  I'd expect only
misplaced object chunks in this scenario the same as with replicated pools.

Is there actually any reduction in redundancy or not?

The cluster recovers after some time.  I've also tested with more (256)
placement groups but the outcome is the same.

Thanks,
LF.
--
Lars Fenneberg, lf@xxxxxxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx