Re: Large number of misplaced PGs but little backfill going on

Torkil Svensgaard <torkil@xxxxxxxx> · Sat, 23 Mar 2024 20:16:08 +0100

On 23-03-2024 19:05, Alexander E. Patrakov wrote:
Sorry for replying to myself, but "ceph osd pool ls detail" by itself
is insufficient. For every erasure code profile mentioned in the
output, please also run something like this:

ceph osd erasure-code-profile get prf-for-ec-data

...where "prf-for-ec-data" is the name that appears after the words
"erasure profile" in the "ceph osd pool ls detail" output.

[root@lazy ~]# ceph osd pool ls detail | grep erasure
pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5 
crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096 
autoscale_mode off last_change 2257933 lfor 0/1291190/1755101 flags 
hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384 
fast_read 1 compression_algorithm snappy compression_mode aggressive 
application rbd
pool 37 'cephfs.hdd.data' erasure profile DRCMR_k4m5_datacenter_hdd size 
9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048 pgp_num 2048 
autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags 
hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1 
compression_algorithm zstd compression_mode aggressive application cephfs
pool 38 'rbd.ssd.data' erasure profile DRCMR_k4m5_datacenter_ssd size 9 
min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num 32 
autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928 flags 
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 
compression_algorithm zstd compression_mode aggressive application rbd

[root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2
crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8
[root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m5_datacenter_hdd
crush-device-class=hdd
crush-failure-domain=datacenter
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=5
plugin=jerasure
technique=reed_sol_van
w=8
[root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m5_datacenter_ssd
crush-device-class=ssd
crush-failure-domain=datacenter
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=5
plugin=jerasure
technique=reed_sol_van
w=8

But as I understand it those profiles are only used to create the 
initial crush rule for the pool, and we have manually edited those along 
the way. Here are the 3 rules in use for the 3 EC pools:

rule rbd_ec_data {
        id 0
        type erasure
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default class hdd
        step choose indep 0 type datacenter
        step chooseleaf indep 2 type host
        step emit
}
rule cephfs.hdd.data {
        id 7
        type erasure
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default class hdd
        step choose indep 0 type datacenter
        step chooseleaf indep 3 type host
        step emit
}
rule rbd.ssd.data {
        id 8
        type erasure
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default class ssd
        step choose indep 0 type datacenter
        step chooseleaf indep 3 type host
        step emit
}

Which should first pick all 3 datacenters in the choose step and then 
either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and 4+5 
respectively.

Mvh.

Torkil

On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov
<patrakov@xxxxxxxxx> wrote:

Hi Torkil,

I take my previous response back.

You have an erasure-coded pool with nine shards but only three
datacenters. This, in general, cannot work. You need either nine
datacenters or a very custom CRUSH rule. The second option may not be
available if the current EC setup is already incompatible, as there is
no way to change the EC parameters.

It would help if you provided the output of "ceph osd pool ls detail".

On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov
<patrakov@xxxxxxxxx> wrote:

Hi Torkil,

Unfortunately, your files contain nothing obviously bad or suspicious,
except for two things: more PGs than usual and bad balance.

What's your "mon max pg per osd" setting?

On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard <torkil@xxxxxxxx> wrote:

On 2024-03-23 17:54, Kai Stian Olstad wrote:
On Sat, Mar 23, 2024 at 12:09:29PM +0100, Torkil Svensgaard wrote:

The other output is too big for pastebin and I'm not familiar with
paste services, any suggestion for a preferred way to share such
output?

You can attached files to the mail here on the list.

Doh, for some reason I was sure attachments would be stripped. Thanks,
attached.

Mvh.

Torkil

--
Alexander E. Patrakov

--
Alexander E. Patrakov

--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx