Re: 3 DC with 4+5 EC not quite working

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Fri, 12 Jan 2024 09:35:40 +0100 (CET)

Hello Torkil, 

We're using the same ec scheme than yours with k=5 and m=4 over 3 DCs with the below rule: 

rule ec54 { 
        id 3 
        type erasure 
        min_size 3 
        max_size 9 
        step set_chooseleaf_tries 5 
        step set_choose_tries 100 
        step take default class hdd 
        step choose indep 0 type datacenter 
        step chooseleaf indep 3 type host 
        step emit 
} 

Works fine. The only difference I see with your EC rule is the fact that we set min_size and max_size but I doubt this has anything to do with your situation. 

Since the cluster still complains about "Pool cephfs.hdd.data has 1024 placement groups, should have 2048", did you run "ceph osd pool set cephfs.hdd.data pgp_num 2048" right after running "ceph osd pool set cephfs.hdd.data pg_num 2048"? [1] 

Might be that the pool still has 1024 PGs. 

Regards,
Frédéric. 

[1] https://docs.ceph.com/en/mimic/rados/operations/placement-groups/#set-the-number-of-placement-groups  

-----Message original-----

De: Torkil <torkil@xxxxxxxx>
à: ceph-users <ceph-users@xxxxxxx>
Cc: Ruben <rkv@xxxxxxxx>
Envoyé: vendredi 12 janvier 2024 09:00 CET
Sujet :  3 DC with 4+5 EC not quite working

We are looking to create a 3 datacenter 4+5 erasure coded pool but can't 
quite get it to work. Ceph version 17.2.7. These are the hosts (there 
will eventually be 6 hdd hosts in each datacenter): 

-33 886.00842 datacenter 714 
-7 209.93135 host ceph-hdd1 

-69 69.86389 host ceph-flash1 
-6 188.09579 host ceph-hdd2 

-3 233.57649 host ceph-hdd3 

-12 184.54091 host ceph-hdd4 
-34 824.47168 datacenter DCN 
-73 69.86389 host ceph-flash2 
-2 201.78067 host ceph-hdd5 

-81 288.26501 host ceph-hdd6 

-31 264.56207 host ceph-hdd7 

-36 1284.48621 datacenter TBA 
-77 69.86389 host ceph-flash3 
-21 190.83224 host ceph-hdd8 

-29 199.08838 host ceph-hdd9 

-11 193.85382 host ceph-hdd10 

-9 237.28154 host ceph-hdd11 

-26 187.19536 host ceph-hdd12 

-4 206.37102 host ceph-hdd13 

We did this: 

ceph osd erasure-code-profile set DRCMR_k4m5_datacenter_hdd 
plugin=jerasure k=4 m=5 technique=reed_sol_van crush-root=default 
crush-failure-domain=datacenter crush-device-class=hdd 

ceph osd pool create cephfs.hdd.data erasure DRCMR_k4m5_datacenter_hdd 
ceph osd pool set cephfs.hdd.data allow_ec_overwrites true 
ceph osd pool set cephfs.hdd.data pg_autoscale_mode warn 

Didn't quite work: 

" 
[WARN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive, 1 pg 
incomplete 
pg 33.0 is creating+incomplete, acting 
[104,219,NONE,NONE,NONE,41,NONE,NONE,NONE] (reducing pool 
cephfs.hdd.data min_size from 5 may help; search ceph.com/docs for 
'incomplete') 
" 

I then manually changed the crush rule from this: 

" 
rule cephfs.hdd.data { 
id 7 
type erasure 
step set_chooseleaf_tries 5 
step set_choose_tries 100 
step take default class hdd 
step chooseleaf indep 0 type datacenter 
step emit 
} 
" 

To this: 

" 
rule cephfs.hdd.data { 
id 7 
type erasure 
step set_chooseleaf_tries 5 
step set_choose_tries 100 
step take default class hdd 
step choose indep 0 type datacenter 
step chooseleaf indep 3 type host 
step emit 
} 
" 

Based on some testing and dialogue I had with Red Hat support last year 
when we were on RHCS, and it seemed to work. Then: 

ceph fs add_data_pool cephfs cephfs.hdd.data 
ceph fs subvolumegroup create hdd --pool_layout cephfs.hdd.data 

I started copying data to the subvolume and increased pg_num a couple of 
times: 

ceph osd pool set cephfs.hdd.data pg_num 256 
ceph osd pool set cephfs.hdd.data pg_num 2048 

But at some point it failed to activate new PGs eventually leading to this: 

" 
[WARN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs 
mds.cephfs.ceph-flash1.agdajf(mds.0): 64 slow metadata IOs are 
blocked > 30 secs, oldest blocked for 25455 secs 
[WARN] MDS_TRIM: 1 MDSs behind on trimming 
mds.cephfs.ceph-flash1.agdajf(mds.0): Behind on trimming 
(997/128) max_segments: 128, num_segments: 997 
[WARN] PG_AVAILABILITY: Reduced data availability: 5 pgs inactive 
pg 33.6f6 is stuck inactive for 8h, current state 
activating+remapped, last acting [50,79,116,299,98,219,164,124,421] 
pg 33.6fa is stuck inactive for 11h, current state 
activating+undersized+degraded+remapped, last acting 
[17,408,NONE,196,223,290,73,39,11] 
pg 33.705 is stuck inactive for 11h, current state 
activating+undersized+degraded+remapped, last acting 
[33,273,71,NONE,411,96,28,7,161] 
pg 33.721 is stuck inactive for 7h, current state 
activating+remapped, last acting [283,150,209,423,103,325,118,142,87] 
pg 33.726 is stuck inactive for 11h, current state 
activating+undersized+degraded+remapped, last acting 
[234,NONE,416,121,54,141,277,265,19] 
[WARN] PG_DEGRADED: Degraded data redundancy: 1818/1282640036 objects 
degraded (0.000%), 3 pgs degraded, 3 pgs undersized 
pg 33.6fa is stuck undersized for 7h, current state 
activating+undersized+degraded+remapped, last acting 
[17,408,NONE,196,223,290,73,39,11] 
pg 33.705 is stuck undersized for 7h, current state 
activating+undersized+degraded+remapped, last acting 
[33,273,71,NONE,411,96,28,7,161] 
pg 33.726 is stuck undersized for 7h, current state 
activating+undersized+degraded+remapped, last acting 
[234,NONE,416,121,54,141,277,265,19] 
[WARN] POOL_TOO_FEW_PGS: 1 pools have too few placement groups 
Pool cephfs.hdd.data has 1024 placement groups, should have 2048 
[WARN] SLOW_OPS: 3925 slow ops, oldest one blocked for 26012 sec, 
daemons [osd.17,osd.234,osd.283,osd.33,osd.50] have slow ops. 
" 

We had thought it would only affect that particular data pool but 
eventually everything ground to a halt, also RBD on unrelated replicated 
pools. Looking at it now I guess that was because the 5 OSDs were 
blocked for everything and not just the PGs for that data pool? 

We tried restarting the 5 blocked OSDs to no avail and eventually 
resorted to deleting the cephfs.hdd.data data pool to restore service. 

Any suggestions as to what we did wrong? Something to do with min_size? 
The crush rule? 

Thanks. 

Mvh. 

Torkil 
-- 
Torkil Svensgaard 
Systems Administrator 
Danish Research Centre for Magnetic Resonance DRCMR, Section 714 
Copenhagen University Hospital Amager and Hvidovre 
Kettegaard Allé 30, 2650 Hvidovre, Denmark 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx  
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx