Re: CRUSH rule for EC 6+2 on 6-node cluster

Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> · Thu, 27 May 2021 17:16:36 +0200

Hallo Dan, Nathan, thanks for your replies and apologies for my silence.

  Sorry I had made a typo... the rule is really 6+4. And to reply to 
Nathan's message, the rule was built like this in anticipation of 
getting additional servers, at which point in time I will relax the "2 
chunks per OSD" part.

[cephmgr@cephAdmPA1.cephAdmPA1 ~]$ ceph osd pool get 
default.rgw.buckets.data erasure_code_profile
erasure_code_profile: ec_6and4_big
[cephmgr@cephAdmPA1.cephAdmPA1 ~]$ ceph osd erasure-code-profile get 
ec_6and4_big
crush-device-class=big
crush-failure-domain=osd
crush-root=default
jerasure-per-chunk-alignment=false
k=6
m=4
plugin=jerasure
technique=reed_sol_van
w=8

Indeed, Dan:

[cephmgr@cephAdmPA1.cephAdmPA1 ~]$ ceph osd dump | grep upmap | grep 116.453
pg_upmap_items 116.453 [76,49,129,108]

Don't think I ever set such an upmap myself. Do you think it would be 
good to try and remove all upmaps, let the upmap balancer do its magic, 
and check again?

  Thanks!

			Fulvio

On 20/05/2021 18:59, Dan van der Ster wrote:
Hold on: 8+4 needs 12 osds but you only show 10 there. Shouldn't you 
choose 6 type host and then chooseleaf 2 type osd?

.. Dan

On Thu, May 20, 2021, 1:30 PM Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx 
<mailto:fulvio.galeazzi@xxxxxxx>> wrote:

    Hallo Dan, Bryan,
          I have a rule similar to yours, for an 8+4 pool, with only
    difference that I replaced the second "choose" with "chooseleaf", which
    I understand should make no difference:

    rule default.rgw.buckets.data {
              id 6
              type erasure
              min_size 3
              max_size 10
              step set_chooseleaf_tries 5
              step set_choose_tries 100
              step take default class big
              step choose indep 5 type host
              step chooseleaf indep 2 type osd
              step emit
    }

        I am on Nautilus 14.2.16 and while performing a maintenance the
    other
    day, I noticed 2 PGs were incomplete and caused troubles to some users.
    I then verified that (thanks Bryan for the command):

    [cephmgr@cephAdmCT1.cephAdmCT1 clusterCT]$ for osd in $(ceph pg map
    116.453 -f json | jq -r '.up[]'); do ceph osd find $osd | jq -r '.host'
    ; done | sort | uniq -c | sort -n -k1
            2 r2srv07.ct1.box.garr
            2 r2srv10.ct1.box.garr
            2 r3srv07.ct1.box.garr
            4 r1srv02.ct1.box.garr

        You see that 4 PGs were put on r1srv02.
    May be this happened due to some temporary unavailability of the
    host at
    some point? As all my servers are now up and running, is there a way to
    force the placement rule to rerun?

        Thanks!

                             Fulvio

    Il 5/16/2021 11:40 PM, Dan van der Ster ha scritto:
     > Hi Bryan,
     >
     > I had to do something similar, and never found a rule to place
    "up to"
     > 2 chunks per host, so I stayed with the placement of *exactly* 2
     > chunks per host.
     >
     > But I did this slightly differently to what you wrote earlier: my
    rule
     > chooses exactly 4 hosts, then chooses exactly 2 osds on each:
     >
     >          type erasure
     >          min_size 3
     >          max_size 10
     >          step set_chooseleaf_tries 5
     >          step set_choose_tries 100
     >          step take default class hdd
     >          step choose indep 4 type host
     >          step choose indep 2 type osd
     >          step emit
     >
     > If you really need the "up to 2" approach then maybe you can split
     > each host into two "host" crush buckets, with half the OSDs in each.
     > Then a normal host-wise rule should work.
     >
     > Cheers, Dan
     >

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx