Re: Questions on CRUSH map

Konstantin Shalygin <k0ste@xxxxxxxx> · Tue, 21 Aug 2018 19:50:36 +0700

    On 08/20/2018 08:15 PM, Cody wrote:

      Hi Konstantin,

Thank you for looking into my question.

I was trying to understand how to set up CRUSH hierarchies and set
rules for different failure domains. I am particularly confused by the
'step take' and 'step choose|chooseleaf' settings for which I think
are the keys for defining a failure domain in a CRUSH rule.

As for my hypothetical cluster, it is made of 3 racks with 2 hosts on
each. One host has 3 SSD-based OSDs and the other has 3 HDD-based
OSDs. I wished to create two rules: one uses SSD-only and another
HDD-only. Both rules should have a rack level failure domain.

I have attached a diagram that may help to explain my setup. The
following is my CRUSH map configuration (with all typos fixed) for
review:

device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class ssd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class ssd
device 15 osd.15 class hdd
device 16 osd.17 class hdd
device 17 osd.17 class hdd

  host a1-1 {
      id -1
      alg straw
      hash 0
      item osd.0 weight 1.00
      item osd.1 weight 1.00
      item osd.2 weight 1.00
  }

  host a1-2 {
      id -2
      alg straw
      hash 0
      item osd.3 weight 1.00
      item osd.4 weight 1.00
      item osd.5 weight 1.00
  }

  host a2-1 {
      id -3
      alg straw
      hash 0
      item osd.6 weight 1.00
      item osd.7 weight 1.00
      item osd.8 weight 1.00
  }

  host a2-2 {
      id -4
      alg straw
      hash 0
      item osd.9 weight 1.00
      item osd.10 weight 1.00
      item osd.11 weight 1.00
  }

  host a3-1 {
      id -5
      alg straw
      hash 0
      item osd.12 weight 1.00
      item osd.13 weight 1.00
      item osd.14 weight 1.00
  }

  host a3-2 {
      id -6
      alg straw
      hash 0
      item osd.15 weight 1.00
      item osd.16 weight 1.00
      item osd.17 weight 1.00
  }

  rack a1 {
      id -7
      alg straw
      hash 0
      item a1-1 weight 3.0
      item a1-2 weight 3.0
  }

  rack a2 {
      id -5
      alg straw
      hash 0
      item a2-1 weight 3.0
      item a2-2 weight 3.0
  }

  rack a3 {
      id -6
      alg straw
      hash 0
      item a3-1 weight 3.0
      item a3-2 weight 3.0
  }

  row a {
      id -7
      alg straw
      hash 0
      item a1 6.0
      item a2 6.0
      item a3 6.0
  }

  rule ssd {
      id 1
      type replicated
      min_size 2
      max_size 11
      step take a class ssd
      step chooseleaf firstn 0 type rack
      step emit
  }

  rule hdd {
      id 2
      type replicated
      min_size 2
      max_size 11
      step take a class hdd
      step chooseleaf firstn 0 type rack
      step emit
  }

Are the two rules correct?

    Times when you need manually edit CRUSH map is gone. Manual editing
    even in your case has already lead to errors.

        # create new datacenter and move it to default root

        ceph osd crush add-bucket new_datacenter datacenter

        ceph osd crush move new_datacenter root=default

        # create our racks

        ceph osd crush add-bucket rack_a1 rack

        ceph osd crush add-bucket rack_a2 rack

        ceph osd crush add-bucket rack_a3 rack

        # move our racks to our datacenter

        ceph osd crush move rack_a1 datacenter=new_datacenter

        ceph osd crush move rack_a2 datacenter=new_datacenter

        ceph osd crush move rack_a3 datacenter=new_datacenter

        # create our hosts

        ceph osd crush add-bucket host_a1-1 host

        ceph osd crush add-bucket host_a1-2 host

        ceph osd crush add-bucket host_a2-1 host

        ceph osd crush add-bucket host_a2-2 host

        ceph osd crush add-bucket host_a3-1 host

        ceph osd crush add-bucket host_a3-2 host

        # and move it to racks

        ceph osd crush move host_a1-1 rack=rack_a1

        ceph osd crush move host_a1-2 rack=rack_a1

        ceph osd crush move host_a2-1 rack=rack_a2

        ceph osd crush move host_a2-2 rack=rack_a2

        ceph osd crush move host_a3-1 rack=rack_a3

        ceph osd crush move host_a3-2 rack=rack_a3

        # now it's time to deploy osds. when osds is 'up' and 'in' and
        properly class

        # is assigned we can move it hosts. In case when class is wrong,
        i.e.

        # 'nvme' is detected as 'ssd' we can rewrite device class like
        this:

        ceph osd crush rm-device-class osd.5

        ceph osd crush set-device-class nvme osd.5

        # okay, `ceph osd tree` show our osds with device classes, move
        it to hosts:

        ceph osd crush move osd.0 host=host_a1-1

        ceph osd crush move osd.1 host=host_a1-1

        ceph osd crush move osd.2 host=host_a1-1

        ceph osd crush move osd.3 host=host_a1-2

        ceph osd crush move osd.4 host=host_a1-2

        ceph osd crush move osd.5 host=host_a1-2

        <etc>...</etc>

        # when this done we should reweight osds on crush map

        # ssd drives is 960Gb

        ceph osd crush reweight osd.0 0.960

        ceph osd crush reweight osd.1 0.960

        ceph osd crush reweight osd.2 0.960

        # hdd drives is 6Tb

        ceph osd crush reweight osd.3 5.5

        ceph osd crush reweight osd.4 5.5

        ceph osd crush reweight osd.5 5.5

        <etc>...</etc>

        # crush map is ready, now is time to crush rules

        ## New replication rules with device classes

        ceph osd crush rule create-replicated replicated_racks_hdd
        default rack hdd

        ceph osd crush rule create-replicated replicated_racks_ssd
        default rack ssd

        # create new pool with predefined crush rule

        ceph osd pool create replicated_rbd_hdd 128 128 replicated
        replicated_racks_hdd

        # our failure domain is rack

        ceph osd pool set replicated_rbd_hdd min_size 2

        ceph osd pool set replicated_rbd_hdd size 3

        ceph osd pool application enable replicated_rbd_hdd rbd

        # or assign crush rule to existing pool

        ceph osd pool set replicated_rbd_ssd crush_rule
        replicated_racks_ssd

        As you can see, when you use cli your values will be validated
        before implemented, this help in avoid human mistakes. All
        changes you can detect online, if something wrong you have easy
        access to 'ceph osd tree', 'ceph osd pool ls detail' and 'ceph
        osd crush rule dump'. I hope this will help to novice in crush
        understanding.

        k

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com