Regards,
Frédéric.
[1] https://docs.ceph.com/en/mimic/rados/operations/placement-groups/#set-the-number-of-placement-groups
-----Message original-----
De: Torkil <torkil@xxxxxxxx>
à: ceph-users <ceph-users@xxxxxxx>
Cc: Ruben <rkv@xxxxxxxx>
Envoyé: vendredi 12 janvier 2024 09:00 CET
Sujet : 3 DC with 4+5 EC not quite working
We are looking to create a 3 datacenter 4+5 erasure coded pool but can't
quite get it to work. Ceph version 17.2.7. These are the hosts (there
will eventually be 6 hdd hosts in each datacenter):
-33 886.00842 datacenter 714
-7 209.93135 host ceph-hdd1
-69 69.86389 host ceph-flash1
-6 188.09579 host ceph-hdd2
-3 233.57649 host ceph-hdd3
-12 184.54091 host ceph-hdd4
-34 824.47168 datacenter DCN
-73 69.86389 host ceph-flash2
-2 201.78067 host ceph-hdd5
-81 288.26501 host ceph-hdd6
-31 264.56207 host ceph-hdd7
-36 1284.48621 datacenter TBA
-77 69.86389 host ceph-flash3
-21 190.83224 host ceph-hdd8
-29 199.08838 host ceph-hdd9
-11 193.85382 host ceph-hdd10
-9 237.28154 host ceph-hdd11
-26 187.19536 host ceph-hdd12
-4 206.37102 host ceph-hdd13
We did this:
ceph osd erasure-code-profile set DRCMR_k4m5_datacenter_hdd
plugin=jerasure k=4 m=5 technique=reed_sol_van crush-root=default
crush-failure-domain=datacenter crush-device-class=hdd
ceph osd pool create cephfs.hdd.data erasure DRCMR_k4m5_datacenter_hdd
ceph osd pool set cephfs.hdd.data allow_ec_overwrites true
ceph osd pool set cephfs.hdd.data pg_autoscale_mode warn
Didn't quite work:
"
[WARN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive, 1 pg
incomplete
pg 33.0 is creating+incomplete, acting
[104,219,NONE,NONE,NONE,41,NONE,NONE,NONE] (reducing pool
cephfs.hdd.data min_size from 5 may help; search ceph.com/docs for
'incomplete')
"
I then manually changed the crush rule from this:
"
rule cephfs.hdd.data {
id 7
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step chooseleaf indep 0 type datacenter
step emit
}
"
To this:
"
rule cephfs.hdd.data {
id 7
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 0 type datacenter
step chooseleaf indep 3 type host
step emit
}
"
Based on some testing and dialogue I had with Red Hat support last year
when we were on RHCS, and it seemed to work. Then:
ceph fs add_data_pool cephfs cephfs.hdd.data
ceph fs subvolumegroup create hdd --pool_layout cephfs.hdd.data
I started copying data to the subvolume and increased pg_num a couple of
times:
ceph osd pool set cephfs.hdd.data pg_num 256
ceph osd pool set cephfs.hdd.data pg_num 2048
But at some point it failed to activate new PGs eventually leading to this:
"
[WARN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
mds.cephfs.ceph-flash1.agdajf(mds.0): 64 slow metadata IOs are
blocked > 30 secs, oldest blocked for 25455 secs
[WARN] MDS_TRIM: 1 MDSs behind on trimming
mds.cephfs.ceph-flash1.agdajf(mds.0): Behind on trimming
(997/128) max_segments: 128, num_segments: 997
[WARN] PG_AVAILABILITY: Reduced data availability: 5 pgs inactive
pg 33.6f6 is stuck inactive for 8h, current state
activating+remapped, last acting [50,79,116,299,98,219,164,124,421]
pg 33.6fa is stuck inactive for 11h, current state
activating+undersized+degraded+remapped, last acting
[17,408,NONE,196,223,290,73,39,11]
pg 33.705 is stuck inactive for 11h, current state
activating+undersized+degraded+remapped, last acting
[33,273,71,NONE,411,96,28,7,161]
pg 33.721 is stuck inactive for 7h, current state
activating+remapped, last acting [283,150,209,423,103,325,118,142,87]
pg 33.726 is stuck inactive for 11h, current state
activating+undersized+degraded+remapped, last acting
[234,NONE,416,121,54,141,277,265,19]
[WARN] PG_DEGRADED: Degraded data redundancy: 1818/1282640036 objects
degraded (0.000%), 3 pgs degraded, 3 pgs undersized
pg 33.6fa is stuck undersized for 7h, current state
activating+undersized+degraded+remapped, last acting
[17,408,NONE,196,223,290,73,39,11]
pg 33.705 is stuck undersized for 7h, current state
activating+undersized+degraded+remapped, last acting
[33,273,71,NONE,411,96,28,7,161]
pg 33.726 is stuck undersized for 7h, current state
activating+undersized+degraded+remapped, last acting
[234,NONE,416,121,54,141,277,265,19]
[WARN] POOL_TOO_FEW_PGS: 1 pools have too few placement groups
Pool cephfs.hdd.data has 1024 placement groups, should have 2048
[WARN] SLOW_OPS: 3925 slow ops, oldest one blocked for 26012 sec,
daemons [osd.17,osd.234,osd.283,osd.33,osd.50] have slow ops.
"
We had thought it would only affect that particular data pool but
eventually everything ground to a halt, also RBD on unrelated replicated
pools. Looking at it now I guess that was because the 5 OSDs were
blocked for everything and not just the PGs for that data pool?
We tried restarting the 5 blocked OSDs to no avail and eventually
resorted to deleting the cephfs.hdd.data data pool to restore service.
Any suggestions as to what we did wrong? Something to do with min_size?
The crush rule?
Thanks.
Mvh.
Torkil