Re: 3 DC with 4+5 EC not quite working

Christian Wuerdig <christian.wuerdig@xxxxxxxxx> · Mon, 15 Jan 2024 08:37:59 +1300

I could be wrong however as far as I can see you have 9 chunks which
requires 9 failure domains.
Your failure domain is set to datacenter which you only have 3 of. So that
won't work.

You need to set your failure domain to host and then create a crush rule to
choose a DC and choose 3 hosts within each DC
Something like this should work:
        step choose indep 3 type datacenter
        step chooseleaf indep 3 type host

On Fri, 12 Jan 2024 at 20:58, Torkil Svensgaard <torkil@xxxxxxxx> wrote:

> We are looking to create a 3 datacenter 4+5 erasure coded pool but can't
> quite get it to work. Ceph version 17.2.7. These are the hosts (there
> will eventually be 6 hdd hosts in each datacenter):
>
> -33          886.00842      datacenter 714
>   -7          209.93135          host ceph-hdd1
>
> -69           69.86389          host ceph-flash1
>   -6          188.09579          host ceph-hdd2
>
>   -3          233.57649          host ceph-hdd3
>
> -12          184.54091          host ceph-hdd4
> -34          824.47168      datacenter DCN
> -73           69.86389          host ceph-flash2
>   -2          201.78067          host ceph-hdd5
>
> -81          288.26501          host ceph-hdd6
>
> -31          264.56207          host ceph-hdd7
>
> -36         1284.48621      datacenter TBA
> -77           69.86389          host ceph-flash3
> -21          190.83224          host ceph-hdd8
>
> -29          199.08838          host ceph-hdd9
>
> -11          193.85382          host ceph-hdd10
>
>   -9          237.28154          host ceph-hdd11
>
> -26          187.19536          host ceph-hdd12
>
>   -4          206.37102          host ceph-hdd13
>
> We did this:
>
> ceph osd erasure-code-profile set DRCMR_k4m5_datacenter_hdd
> plugin=jerasure k=4 m=5 technique=reed_sol_van crush-root=default
> crush-failure-domain=datacenter crush-device-class=hdd
>
> ceph osd pool create cephfs.hdd.data erasure DRCMR_k4m5_datacenter_hdd
> ceph osd pool set cephfs.hdd.data allow_ec_overwrites true
> ceph osd pool set cephfs.hdd.data pg_autoscale_mode warn
>
> Didn't quite work:
>
> "
> [WARN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive, 1 pg
> incomplete
>          pg 33.0 is creating+incomplete, acting
> [104,219,NONE,NONE,NONE,41,NONE,NONE,NONE] (reducing pool
> cephfs.hdd.data min_size from 5 may help; search ceph.com/docs for
> 'incomplete')
> "
>
> I then manually changed the crush rule from this:
>
> "
> rule cephfs.hdd.data {
>      id 7
>      type erasure
>      step set_chooseleaf_tries 5
>      step set_choose_tries 100
>      step take default class hdd
>      step chooseleaf indep 0 type datacenter
>      step emit
> }
> "
>
> To this:
>
> "
> rule cephfs.hdd.data {
>          id 7
>          type erasure
>          step set_chooseleaf_tries 5
>          step set_choose_tries 100
>          step take default class hdd
>          step choose indep 0 type datacenter
>          step chooseleaf indep 3 type host
>          step emit
> }
> "
>
> Based on some testing and dialogue I had with Red Hat support last year
> when we were on RHCS, and it seemed to work. Then:
>
> ceph fs add_data_pool cephfs cephfs.hdd.data
> ceph fs subvolumegroup create hdd --pool_layout cephfs.hdd.data
>
> I started copying data to the subvolume and increased pg_num a couple of
> times:
>
> ceph osd pool set cephfs.hdd.data pg_num 256
> ceph osd pool set cephfs.hdd.data pg_num 2048
>
> But at some point it failed to activate new PGs eventually leading to this:
>
> "
> [WARN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
>          mds.cephfs.ceph-flash1.agdajf(mds.0): 64 slow metadata IOs are
> blocked > 30 secs, oldest blocked for 25455 secs
> [WARN] MDS_TRIM: 1 MDSs behind on trimming
>          mds.cephfs.ceph-flash1.agdajf(mds.0): Behind on trimming
> (997/128) max_segments: 128, num_segments: 997
> [WARN] PG_AVAILABILITY: Reduced data availability: 5 pgs inactive
>          pg 33.6f6 is stuck inactive for 8h, current state
> activating+remapped, last acting [50,79,116,299,98,219,164,124,421]
>          pg 33.6fa is stuck inactive for 11h, current state
> activating+undersized+degraded+remapped, last acting
> [17,408,NONE,196,223,290,73,39,11]
>          pg 33.705 is stuck inactive for 11h, current state
> activating+undersized+degraded+remapped, last acting
> [33,273,71,NONE,411,96,28,7,161]
>          pg 33.721 is stuck inactive for 7h, current state
> activating+remapped, last acting [283,150,209,423,103,325,118,142,87]
>          pg 33.726 is stuck inactive for 11h, current state
> activating+undersized+degraded+remapped, last acting
> [234,NONE,416,121,54,141,277,265,19]
> [WARN] PG_DEGRADED: Degraded data redundancy: 1818/1282640036 objects
> degraded (0.000%), 3 pgs degraded, 3 pgs undersized
>          pg 33.6fa is stuck undersized for 7h, current state
> activating+undersized+degraded+remapped, last acting
> [17,408,NONE,196,223,290,73,39,11]
>          pg 33.705 is stuck undersized for 7h, current state
> activating+undersized+degraded+remapped, last acting
> [33,273,71,NONE,411,96,28,7,161]
>          pg 33.726 is stuck undersized for 7h, current state
> activating+undersized+degraded+remapped, last acting
> [234,NONE,416,121,54,141,277,265,19]
> [WARN] POOL_TOO_FEW_PGS: 1 pools have too few placement groups
>          Pool cephfs.hdd.data has 1024 placement groups, should have 2048
> [WARN] SLOW_OPS: 3925 slow ops, oldest one blocked for 26012 sec,
> daemons [osd.17,osd.234,osd.283,osd.33,osd.50] have slow ops.
> "
>
> We had thought it would only affect that particular data pool but
> eventually everything ground to a halt, also RBD on unrelated replicated
> pools. Looking at it now I guess that was because the 5 OSDs were
> blocked for everything and not just the PGs for that data pool?
>
> We tried restarting the 5 blocked OSDs to no avail and eventually
> resorted to deleting the cephfs.hdd.data data pool to restore service.
>
> Any suggestions as to what we did wrong? Something to do with min_size?
> The crush rule?
>
> Thanks.
>
> Mvh.
>
> Torkil
> --
> Torkil Svensgaard
> Systems Administrator
> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
> Copenhagen University Hospital Amager and Hvidovre
> Kettegaard Allé 30, 2650 Hvidovre, Denmark
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx