Re: 3 DC with 4+5 EC not quite working

Frank Schilder <frans@xxxxxx> · Fri, 12 Jan 2024 09:30:20 +0000

Is it maybe this here: https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon

I always have to tweak the num-tries parameters.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Torkil Svensgaard <torkil@xxxxxxxx>
Sent: Friday, January 12, 2024 10:17 AM
To: Frédéric Nass
Cc: ceph-users@xxxxxxx; Ruben Vestergaard
Subject:  Re: 3 DC with 4+5 EC not quite working

On 12-01-2024 09:35, Frédéric Nass wrote:
>
> Hello Torkil,

Hi Frédéric

> We're using the same ec scheme than yours with k=5 and m=4 over 3 DCs with the below rule:
>
>
> rule ec54 {
>          id 3
>          type erasure
>          min_size 3
>          max_size 9
>          step set_chooseleaf_tries 5
>          step set_choose_tries 100
>          step take default class hdd
>          step choose indep 0 type datacenter
>          step chooseleaf indep 3 type host
>          step emit
> }
>
> Works fine. The only difference I see with your EC rule is the fact that we set min_size and max_size but I doubt this has anything to do with your situation.

Great, thanks. I wonder if we might need to tweak the min_size. I think
I tried lowering it to no avail and then set it back to 5 after editing
the crush rule.

> Since the cluster still complains about "Pool cephfs.hdd.data has 1024 placement groups, should have 2048", did you run "ceph osd pool set cephfs.hdd.data pgp_num 2048" right after running "ceph osd pool set cephfs.hdd.data pg_num 2048"? [1]
>
> Might be that the pool still has 1024 PGs.

Hmm coming from RHCS we didn't do this as:

"
RHCS 4.x and 5.x does not require the pgp_num value to be set. This will
be done by ceph-mgr automatically. For RHCS 4.x and 5.x, only the pg_num
is required to be incremented for the necessary pools.
"

So I only did "ceph osd pool set cephfs.hdd.data pgp_num 2048" and let
the mgr handle the rest. I had a watch running to see how it went and
the pool was up to something like 1922 PGs when it got stuck.

As I read the documentation[1] this shouldn't get us stuck like we did,
but we would have to set the pgp_num eventually to get it to rebalance?

Mvh.

Torkil

[1]
https://docs.ceph.com/en/quincy/rados/operations/placement-groups/#setting-the-number-of-pgs

>
>
> Regards,
> Frédéric.
>
> [1] https://docs.ceph.com/en/mimic/rados/operations/placement-groups/#set-the-number-of-placement-groups
>
>
>
> -----Message original-----
>
> De: Torkil <torkil@xxxxxxxx>
> à: ceph-users <ceph-users@xxxxxxx>
> Cc: Ruben <rkv@xxxxxxxx>
> Envoyé: vendredi 12 janvier 2024 09:00 CET
> Sujet :  3 DC with 4+5 EC not quite working
>
> We are looking to create a 3 datacenter 4+5 erasure coded pool but can't
> quite get it to work. Ceph version 17.2.7. These are the hosts (there
> will eventually be 6 hdd hosts in each datacenter):
>
> -33 886.00842 datacenter 714
> -7 209.93135 host ceph-hdd1
>
> -69 69.86389 host ceph-flash1
> -6 188.09579 host ceph-hdd2
>
> -3 233.57649 host ceph-hdd3
>
> -12 184.54091 host ceph-hdd4
> -34 824.47168 datacenter DCN
> -73 69.86389 host ceph-flash2
> -2 201.78067 host ceph-hdd5
>
> -81 288.26501 host ceph-hdd6
>
> -31 264.56207 host ceph-hdd7
>
> -36 1284.48621 datacenter TBA
> -77 69.86389 host ceph-flash3
> -21 190.83224 host ceph-hdd8
>
> -29 199.08838 host ceph-hdd9
>
> -11 193.85382 host ceph-hdd10
>
> -9 237.28154 host ceph-hdd11
>
> -26 187.19536 host ceph-hdd12
>
> -4 206.37102 host ceph-hdd13
>
> We did this:
>
> ceph osd erasure-code-profile set DRCMR_k4m5_datacenter_hdd
> plugin=jerasure k=4 m=5 technique=reed_sol_van crush-root=default
> crush-failure-domain=datacenter crush-device-class=hdd
>
> ceph osd pool create cephfs.hdd.data erasure DRCMR_k4m5_datacenter_hdd
> ceph osd pool set cephfs.hdd.data allow_ec_overwrites true
> ceph osd pool set cephfs.hdd.data pg_autoscale_mode warn
>
> Didn't quite work:
>
> "
> [WARN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive, 1 pg
> incomplete
> pg 33.0 is creating+incomplete, acting
> [104,219,NONE,NONE,NONE,41,NONE,NONE,NONE] (reducing pool
> cephfs.hdd.data min_size from 5 may help; search ceph.com/docs for
> 'incomplete')
> "
>
> I then manually changed the crush rule from this:
>
> "
> rule cephfs.hdd.data {
> id 7
> type erasure
> step set_chooseleaf_tries 5
> step set_choose_tries 100
> step take default class hdd
> step chooseleaf indep 0 type datacenter
> step emit
> }
> "
>
> To this:
>
> "
> rule cephfs.hdd.data {
> id 7
> type erasure
> step set_chooseleaf_tries 5
> step set_choose_tries 100
> step take default class hdd
> step choose indep 0 type datacenter
> step chooseleaf indep 3 type host
> step emit
> }
> "
>
> Based on some testing and dialogue I had with Red Hat support last year
> when we were on RHCS, and it seemed to work. Then:
>
> ceph fs add_data_pool cephfs cephfs.hdd.data
> ceph fs subvolumegroup create hdd --pool_layout cephfs.hdd.data
>
> I started copying data to the subvolume and increased pg_num a couple of
> times:
>
> ceph osd pool set cephfs.hdd.data pg_num 256
> ceph osd pool set cephfs.hdd.data pg_num 2048
>
> But at some point it failed to activate new PGs eventually leading to this:
>
> "
> [WARN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
> mds.cephfs.ceph-flash1.agdajf(mds.0): 64 slow metadata IOs are
> blocked > 30 secs, oldest blocked for 25455 secs
> [WARN] MDS_TRIM: 1 MDSs behind on trimming
> mds.cephfs.ceph-flash1.agdajf(mds.0): Behind on trimming
> (997/128) max_segments: 128, num_segments: 997
> [WARN] PG_AVAILABILITY: Reduced data availability: 5 pgs inactive
> pg 33.6f6 is stuck inactive for 8h, current state
> activating+remapped, last acting [50,79,116,299,98,219,164,124,421]
> pg 33.6fa is stuck inactive for 11h, current state
> activating+undersized+degraded+remapped, last acting
> [17,408,NONE,196,223,290,73,39,11]
> pg 33.705 is stuck inactive for 11h, current state
> activating+undersized+degraded+remapped, last acting
> [33,273,71,NONE,411,96,28,7,161]
> pg 33.721 is stuck inactive for 7h, current state
> activating+remapped, last acting [283,150,209,423,103,325,118,142,87]
> pg 33.726 is stuck inactive for 11h, current state
> activating+undersized+degraded+remapped, last acting
> [234,NONE,416,121,54,141,277,265,19]
> [WARN] PG_DEGRADED: Degraded data redundancy: 1818/1282640036 objects
> degraded (0.000%), 3 pgs degraded, 3 pgs undersized
> pg 33.6fa is stuck undersized for 7h, current state
> activating+undersized+degraded+remapped, last acting
> [17,408,NONE,196,223,290,73,39,11]
> pg 33.705 is stuck undersized for 7h, current state
> activating+undersized+degraded+remapped, last acting
> [33,273,71,NONE,411,96,28,7,161]
> pg 33.726 is stuck undersized for 7h, current state
> activating+undersized+degraded+remapped, last acting
> [234,NONE,416,121,54,141,277,265,19]
> [WARN] POOL_TOO_FEW_PGS: 1 pools have too few placement groups
> Pool cephfs.hdd.data has 1024 placement groups, should have 2048
> [WARN] SLOW_OPS: 3925 slow ops, oldest one blocked for 26012 sec,
> daemons [osd.17,osd.234,osd.283,osd.33,osd.50] have slow ops.
> "
>
> We had thought it would only affect that particular data pool but
> eventually everything ground to a halt, also RBD on unrelated replicated
> pools. Looking at it now I guess that was because the 5 OSDs were
> blocked for everything and not just the PGs for that data pool?
>
> We tried restarting the 5 blocked OSDs to no avail and eventually
> resorted to deleting the cephfs.hdd.data data pool to restore service.
>
> Any suggestions as to what we did wrong? Something to do with min_size?
> The crush rule?
>
> Thanks.
>
> Mvh.
>
> Torkil

--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx