Re: Erasure Coding Pools and PG calculation - documentation

Tim Gipson <tgipson@xxxxxxx> · Thu, 16 Nov 2017 14:15:45 +0000

Here is my crushmap.  You can see our general setup.  We are using the bottom rule for the EC pool.  

We are trying to get to the point where we can lose an entire host and the cluster will continue to work.

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 22 osd.22 class hdd
device 23 osd.23 class hdd
device 24 osd.24 class hdd
device 25 osd.25 class hdd
device 26 osd.26 class hdd
device 27 osd.27 class hdd
device 28 osd.28 class hdd
device 29 osd.29 class hdd
device 30 osd.30 class hdd
device 31 osd.31 class hdd
device 32 osd.32 class hdd
device 33 osd.33 class hdd
device 34 osd.34 class hdd
device 35 osd.35 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host osd01tv01 {
        id -3           # do not change unnecessarily
        id -4 class hdd         # do not change unnecessarily
        # weight 109.152
        alg straw2
        hash 0  # rjenkins1
        item osd.0 weight 9.096
        item osd.3 weight 9.096
        item osd.6 weight 9.096
        item osd.9 weight 9.096
        item osd.12 weight 9.096
        item osd.15 weight 9.096
        item osd.18 weight 9.096
        item osd.21 weight 9.096
        item osd.24 weight 9.096
        item osd.27 weight 9.096
        item osd.30 weight 9.096
        item osd.33 weight 9.096
}
host osd02tv01 {
        id -5           # do not change unnecessarily
        id -6 class hdd         # do not change unnecessarily
        # weight 109.152
        alg straw2
        hash 0  # rjenkins1
        item osd.1 weight 9.096
        item osd.4 weight 9.096
        item osd.7 weight 9.096
        item osd.10 weight 9.096
        item osd.13 weight 9.096
        item osd.16 weight 9.096
        item osd.19 weight 9.096
        item osd.22 weight 9.096
        item osd.25 weight 9.096
        item osd.28 weight 9.096
        item osd.31 weight 9.096
        item osd.34 weight 9.096
}
host osd03tv01 {
        id -7           # do not change unnecessarily
        id -8 class hdd         # do not change unnecessarily
        # weight 109.152
        alg straw2
        hash 0  # rjenkins1
        item osd.2 weight 9.096
        item osd.5 weight 9.096
        item osd.8 weight 9.096
        item osd.11 weight 9.096
        item osd.14 weight 9.096
        item osd.17 weight 9.096
        item osd.20 weight 9.096
        item osd.23 weight 9.096
        item osd.26 weight 9.096
        item osd.29 weight 9.096
        item osd.32 weight 9.096
        item osd.35 weight 9.096
}
root default {
        id -1           # do not change unnecessarily
        id -2 class hdd         # do not change unnecessarily
        # weight 327.441
        alg straw2
        hash 0  # rjenkins1
        item osd01tv01 weight 109.147
        item osd02tv01 weight 109.147
        item osd03tv01 weight 109.147
}

# rules
rule replicated_rule {
        id 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}
rule default.rgw.buckets.data {
        id 1
        type erasure
        min_size 3
        max_size 3
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default
        step choose indep 2 type host
        step choose indep 2 type osd
        step emit
}

# end crush map

Thanks again for all the help!

Tim Gipson
Systems Engineer

On 11/12/17, 10:57 PM, "Christian Wuerdig" <christian.wuerdig@xxxxxxxxx> wrote:

    Well, as stated in the other email I think in the EC scenario you can
    set size=k+m for the pgcalc tool. If you want 10+2 then in theory you
    should be able to get away with 6 nodes to survive a single node
    failure if you can guarantee that every node will always receive 2 out
    of the 12 chunks - looks like this might be achievable:
    http://ceph.com/planet/erasure-code-on-small-clusters/

    On Mon, Nov 13, 2017 at 1:32 PM, Tim Gipson <tgipson@xxxxxxx> wrote:
    > I guess my questions are more centered around k+m and PG calculations.
    >
    > As we were starting to build and test our EC pools with our infrastructure we were trying to figure out what our calculations needed to be starting with 3 OSD hosts with 12 x 10 TB OSDs a piece.  The nodes have the ability to expand to 24 drives a piece and we hope to eventually get to around a 1PB cluster after we add some more hosts.  Initially we hoped to be able to do a k=10 m=2 on the pool but I am not sure that is going to be feasible.  We’d like to set up the failure domain so that we would be able to lose an entire host without losing the cluster.  At this point I’m not sure that’s possible without bringing in more hosts.
    >
    > Thanks for the help!
    >
    > Tim Gipson
    >
    >
    > On 11/12/17, 5:14 PM, "Christian Wuerdig" <christian.wuerdig@xxxxxxxxx> wrote:
    >
    >     I might be wrong, but from memory I think you can use
    >     http://ceph.com/pgcalc/ and use k+m for the size
    >
    >     On Sun, Nov 12, 2017 at 5:41 AM, Ashley Merrick <ashley@xxxxxxxxxxxxxx> wrote:
    >     > Hello,
    >     >
    >     > Are you having any issues with getting the pool working or just around the
    >     > PG num you should use?
    >     >
    >     > ,Ashley
    >     >
    >     > Get Outlook for Android
    >     >
    >     > ________________________________
    >     > From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Tim Gipson
    >     > <tgipson@xxxxxxx>
    >     > Sent: Saturday, November 11, 2017 5:38:02 AM
    >     > To: ceph-users@xxxxxxxxxxxxxx
    >     > Subject:  Erasure Coding Pools and PG calculation -
    >     > documentation
    >     >
    >     > Hey all,
    >     >
    >     > I’m having some trouble setting up a Pool for Erasure Coding.  I haven’t
    >     > found much documentation around the PG calculation for an Erasure Coding
    >     > pool.  It seems from what I’ve tried so far that the math needed to set one
    >     > up is different than the math you use to calculate PGs for a regular
    >     > replicated pool.
    >     >
    >     > Does anyone have any experience setting up a pool this way and can you give
    >     > me some help or direction, or point me toward some documentation that goes
    >     > over the math behind this sort of pool setup?
    >     >
    >     > Any help would be greatly appreciated!
    >     >
    >     > Thanks,
    >     >
    >     >
    >     > Tim Gipson
    >     > Systems Engineer
    >     >
    >     >
    >     > _______________________________________________
    >     > ceph-users mailing list
    >     > ceph-users@xxxxxxxxxxxxxx
    >     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    >     >
    >     > _______________________________________________
    >     > ceph-users mailing list
    >     > ceph-users@xxxxxxxxxxxxxx
    >     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    >     >
    >
    >

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com