Re: Questions on CRUSH map

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 08/20/2018 08:15 PM, Cody wrote:
Hi Konstantin,

Thank you for looking into my question.

I was trying to understand how to set up CRUSH hierarchies and set
rules for different failure domains. I am particularly confused by the
'step take' and 'step choose|chooseleaf' settings for which I think
are the keys for defining a failure domain in a CRUSH rule.

As for my hypothetical cluster, it is made of 3 racks with 2 hosts on
each. One host has 3 SSD-based OSDs and the other has 3 HDD-based
OSDs. I wished to create two rules: one uses SSD-only and another
HDD-only. Both rules should have a rack level failure domain.

I have attached a diagram that may help to explain my setup. The
following is my CRUSH map configuration (with all typos fixed) for
review:

device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class ssd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class ssd
device 15 osd.15 class hdd
device 16 osd.17 class hdd
device 17 osd.17 class hdd

  host a1-1 {
      id -1
      alg straw
      hash 0
      item osd.0 weight 1.00
      item osd.1 weight 1.00
      item osd.2 weight 1.00
  }

  host a1-2 {
      id -2
      alg straw
      hash 0
      item osd.3 weight 1.00
      item osd.4 weight 1.00
      item osd.5 weight 1.00
  }

  host a2-1 {
      id -3
      alg straw
      hash 0
      item osd.6 weight 1.00
      item osd.7 weight 1.00
      item osd.8 weight 1.00
  }

  host a2-2 {
      id -4
      alg straw
      hash 0
      item osd.9 weight 1.00
      item osd.10 weight 1.00
      item osd.11 weight 1.00
  }

  host a3-1 {
      id -5
      alg straw
      hash 0
      item osd.12 weight 1.00
      item osd.13 weight 1.00
      item osd.14 weight 1.00
  }

  host a3-2 {
      id -6
      alg straw
      hash 0
      item osd.15 weight 1.00
      item osd.16 weight 1.00
      item osd.17 weight 1.00
  }

  rack a1 {
      id -7
      alg straw
      hash 0
      item a1-1 weight 3.0
      item a1-2 weight 3.0
  }

  rack a2 {
      id -5
      alg straw
      hash 0
      item a2-1 weight 3.0
      item a2-2 weight 3.0
  }

  rack a3 {
      id -6
      alg straw
      hash 0
      item a3-1 weight 3.0
      item a3-2 weight 3.0
  }

  row a {
      id -7
      alg straw
      hash 0
      item a1 6.0
      item a2 6.0
      item a3 6.0
  }

  rule ssd {
      id 1
      type replicated
      min_size 2
      max_size 11
      step take a class ssd
      step chooseleaf firstn 0 type rack
      step emit
  }

  rule hdd {
      id 2
      type replicated
      min_size 2
      max_size 11
      step take a class hdd
      step chooseleaf firstn 0 type rack
      step emit
  }


Are the two rules correct?



Times when you need manually edit CRUSH map is gone. Manual editing even in your case has already lead to errors.



# create new datacenter and move it to default root
ceph osd crush add-bucket new_datacenter datacenter
ceph osd crush move new_datacenter root=default
# create our racks
ceph osd crush add-bucket rack_a1 rack
ceph osd crush add-bucket rack_a2 rack
ceph osd crush add-bucket rack_a3 rack
# move our racks to our datacenter
ceph osd crush move rack_a1 datacenter=new_datacenter
ceph osd crush move rack_a2 datacenter=new_datacenter
ceph osd crush move rack_a3 datacenter=new_datacenter
# create our hosts
ceph osd crush add-bucket host_a1-1 host
ceph osd crush add-bucket host_a1-2 host
ceph osd crush add-bucket host_a2-1 host
ceph osd crush add-bucket host_a2-2 host
ceph osd crush add-bucket host_a3-1 host
ceph osd crush add-bucket host_a3-2 host
# and move it to racks
ceph osd crush move host_a1-1 rack=rack_a1
ceph osd crush move host_a1-2 rack=rack_a1
ceph osd crush move host_a2-1 rack=rack_a2
ceph osd crush move host_a2-2 rack=rack_a2
ceph osd crush move host_a3-1 rack=rack_a3
ceph osd crush move host_a3-2 rack=rack_a3
# now it's time to deploy osds. when osds is 'up' and 'in' and properly class
# is assigned we can move it hosts. In case when class is wrong, i.e.
# 'nvme' is detected as 'ssd' we can rewrite device class like this:
ceph osd crush rm-device-class osd.5
ceph osd crush set-device-class nvme osd.5
# okay, `ceph osd tree` show our osds with device classes, move it to hosts:
ceph osd crush move osd.0 host=host_a1-1
ceph osd crush move osd.1 host=host_a1-1
ceph osd crush move osd.2 host=host_a1-1
ceph osd crush move osd.3 host=host_a1-2
ceph osd crush move osd.4 host=host_a1-2
ceph osd crush move osd.5 host=host_a1-2
<etc>...</etc>
# when this done we should reweight osds on crush map
# ssd drives is 960Gb
ceph osd crush reweight osd.0 0.960
ceph osd crush reweight osd.1 0.960
ceph osd crush reweight osd.2 0.960
# hdd drives is 6Tb
ceph osd crush reweight osd.3 5.5
ceph osd crush reweight osd.4 5.5
ceph osd crush reweight osd.5 5.5
<etc>...</etc>
# crush map is ready, now is time to crush rules
## New replication rules with device classes
ceph osd crush rule create-replicated replicated_racks_hdd default rack hdd
ceph osd crush rule create-replicated replicated_racks_ssd default rack ssd
# create new pool with predefined crush rule
ceph osd pool create replicated_rbd_hdd 128 128 replicated replicated_racks_hdd
# our failure domain is rack
ceph osd pool set replicated_rbd_hdd min_size 2
ceph osd pool set replicated_rbd_hdd size 3
ceph osd pool application enable replicated_rbd_hdd rbd
# or assign crush rule to existing pool
ceph osd pool set replicated_rbd_ssd crush_rule replicated_racks_ssd



As you can see, when you use cli your values will be validated before implemented, this help in avoid human mistakes. All changes you can detect online, if something wrong you have easy access to 'ceph osd tree', 'ceph osd pool ls detail' and 'ceph osd crush rule dump'. I hope this will help to novice in crush understanding.





k
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux