Re: 3 DC with 4+5 EC not quite working

Torkil Svensgaard <torkil@xxxxxxxx> · Fri, 12 Jan 2024 13:03:01 +0100

On 12-01-2024 10:30, Frank Schilder wrote:
Is it maybe this here: https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon

I always have to tweak the num-tries parameters.

Oh, that seems plausible. Kinda scary, and odd, that hitting this 
doesn't generate a ceph health warning when the problem is fairly well 
documented?

Anyhoo, we decided to redo the pool etc going straight to 2048 PGs 
before copying any data and it worked just fine this time. Thanks a lot, 
both of you.

Mvh.

Torkil

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Torkil Svensgaard <torkil@xxxxxxxx>
Sent: Friday, January 12, 2024 10:17 AM
To: Frédéric Nass
Cc: ceph-users@xxxxxxx; Ruben Vestergaard
Subject:  Re: 3 DC with 4+5 EC not quite working

On 12-01-2024 09:35, Frédéric Nass wrote:

Hello Torkil,

Hi Frédéric

We're using the same ec scheme than yours with k=5 and m=4 over 3 DCs with the below rule:

rule ec54 {
          id 3
          type erasure
          min_size 3
          max_size 9
          step set_chooseleaf_tries 5
          step set_choose_tries 100
          step take default class hdd
          step choose indep 0 type datacenter
          step chooseleaf indep 3 type host
          step emit
}

Works fine. The only difference I see with your EC rule is the fact that we set min_size and max_size but I doubt this has anything to do with your situation.

Great, thanks. I wonder if we might need to tweak the min_size. I think
I tried lowering it to no avail and then set it back to 5 after editing
the crush rule.

Since the cluster still complains about "Pool cephfs.hdd.data has 1024 placement groups, should have 2048", did you run "ceph osd pool set cephfs.hdd.data pgp_num 2048" right after running "ceph osd pool set cephfs.hdd.data pg_num 2048"? [1]

Might be that the pool still has 1024 PGs.

Hmm coming from RHCS we didn't do this as:

"
RHCS 4.x and 5.x does not require the pgp_num value to be set. This will
be done by ceph-mgr automatically. For RHCS 4.x and 5.x, only the pg_num
is required to be incremented for the necessary pools.
"

So I only did "ceph osd pool set cephfs.hdd.data pgp_num 2048" and let
the mgr handle the rest. I had a watch running to see how it went and
the pool was up to something like 1922 PGs when it got stuck.

As I read the documentation[1] this shouldn't get us stuck like we did,
but we would have to set the pgp_num eventually to get it to rebalance?

Mvh.

Torkil

[1]
https://docs.ceph.com/en/quincy/rados/operations/placement-groups/#setting-the-number-of-pgs

Regards,
Frédéric.

[1] https://docs.ceph.com/en/mimic/rados/operations/placement-groups/#set-the-number-of-placement-groups

-----Message original-----

De: Torkil <torkil@xxxxxxxx>
à: ceph-users <ceph-users@xxxxxxx>
Cc: Ruben <rkv@xxxxxxxx>
Envoyé: vendredi 12 janvier 2024 09:00 CET
Sujet :  3 DC with 4+5 EC not quite working

We are looking to create a 3 datacenter 4+5 erasure coded pool but can't
quite get it to work. Ceph version 17.2.7. These are the hosts (there
will eventually be 6 hdd hosts in each datacenter):

-33 886.00842 datacenter 714
-7 209.93135 host ceph-hdd1

-69 69.86389 host ceph-flash1
-6 188.09579 host ceph-hdd2

-3 233.57649 host ceph-hdd3

-12 184.54091 host ceph-hdd4
-34 824.47168 datacenter DCN
-73 69.86389 host ceph-flash2
-2 201.78067 host ceph-hdd5

-81 288.26501 host ceph-hdd6

-31 264.56207 host ceph-hdd7

-36 1284.48621 datacenter TBA
-77 69.86389 host ceph-flash3
-21 190.83224 host ceph-hdd8

-29 199.08838 host ceph-hdd9

-11 193.85382 host ceph-hdd10

-9 237.28154 host ceph-hdd11

-26 187.19536 host ceph-hdd12

-4 206.37102 host ceph-hdd13

We did this:

ceph osd erasure-code-profile set DRCMR_k4m5_datacenter_hdd
plugin=jerasure k=4 m=5 technique=reed_sol_van crush-root=default
crush-failure-domain=datacenter crush-device-class=hdd

ceph osd pool create cephfs.hdd.data erasure DRCMR_k4m5_datacenter_hdd
ceph osd pool set cephfs.hdd.data allow_ec_overwrites true
ceph osd pool set cephfs.hdd.data pg_autoscale_mode warn

Didn't quite work:

"
[WARN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive, 1 pg
incomplete
pg 33.0 is creating+incomplete, acting
[104,219,NONE,NONE,NONE,41,NONE,NONE,NONE] (reducing pool
cephfs.hdd.data min_size from 5 may help; search ceph.com/docs for
'incomplete')
"

I then manually changed the crush rule from this:

"
rule cephfs.hdd.data {
id 7
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step chooseleaf indep 0 type datacenter
step emit
}
"

To this:

"
rule cephfs.hdd.data {
id 7
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 0 type datacenter
step chooseleaf indep 3 type host
step emit
}
"

Based on some testing and dialogue I had with Red Hat support last year
when we were on RHCS, and it seemed to work. Then:

ceph fs add_data_pool cephfs cephfs.hdd.data
ceph fs subvolumegroup create hdd --pool_layout cephfs.hdd.data

I started copying data to the subvolume and increased pg_num a couple of
times:

ceph osd pool set cephfs.hdd.data pg_num 256
ceph osd pool set cephfs.hdd.data pg_num 2048

But at some point it failed to activate new PGs eventually leading to this:

"
[WARN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
mds.cephfs.ceph-flash1.agdajf(mds.0): 64 slow metadata IOs are
blocked > 30 secs, oldest blocked for 25455 secs
[WARN] MDS_TRIM: 1 MDSs behind on trimming
mds.cephfs.ceph-flash1.agdajf(mds.0): Behind on trimming
(997/128) max_segments: 128, num_segments: 997
[WARN] PG_AVAILABILITY: Reduced data availability: 5 pgs inactive
pg 33.6f6 is stuck inactive for 8h, current state
activating+remapped, last acting [50,79,116,299,98,219,164,124,421]
pg 33.6fa is stuck inactive for 11h, current state
activating+undersized+degraded+remapped, last acting
[17,408,NONE,196,223,290,73,39,11]
pg 33.705 is stuck inactive for 11h, current state
activating+undersized+degraded+remapped, last acting
[33,273,71,NONE,411,96,28,7,161]
pg 33.721 is stuck inactive for 7h, current state
activating+remapped, last acting [283,150,209,423,103,325,118,142,87]
pg 33.726 is stuck inactive for 11h, current state
activating+undersized+degraded+remapped, last acting
[234,NONE,416,121,54,141,277,265,19]
[WARN] PG_DEGRADED: Degraded data redundancy: 1818/1282640036 objects
degraded (0.000%), 3 pgs degraded, 3 pgs undersized
pg 33.6fa is stuck undersized for 7h, current state
activating+undersized+degraded+remapped, last acting
[17,408,NONE,196,223,290,73,39,11]
pg 33.705 is stuck undersized for 7h, current state
activating+undersized+degraded+remapped, last acting
[33,273,71,NONE,411,96,28,7,161]
pg 33.726 is stuck undersized for 7h, current state
activating+undersized+degraded+remapped, last acting
[234,NONE,416,121,54,141,277,265,19]
[WARN] POOL_TOO_FEW_PGS: 1 pools have too few placement groups
Pool cephfs.hdd.data has 1024 placement groups, should have 2048
[WARN] SLOW_OPS: 3925 slow ops, oldest one blocked for 26012 sec,
daemons [osd.17,osd.234,osd.283,osd.33,osd.50] have slow ops.
"

We had thought it would only affect that particular data pool but
eventually everything ground to a halt, also RBD on unrelated replicated
pools. Looking at it now I guess that was because the 5 OSDs were
blocked for everything and not just the PGs for that data pool?

We tried restarting the 5 blocked OSDs to no avail and eventually
resorted to deleting the cephfs.hdd.data data pool to restore service.

Any suggestions as to what we did wrong? Something to do with min_size?
The crush rule?

Thanks.

Mvh.

Torkil

--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx