Re: Questions on CRUSH map

Cody <codeology.lab@xxxxxxxxx> · Tue, 21 Aug 2018 09:07:07 -0400

Hi Konstantin,

I could only dream of reading this answer! Thank you so much!!!

Regards,
Cody

On Tue, Aug 21, 2018 at 8:50 AM Konstantin Shalygin <k0ste@xxxxxxxx> wrote:
>
> On 08/20/2018 08:15 PM, Cody wrote:
>
> Hi Konstantin,
>
> Thank you for looking into my question.
>
> I was trying to understand how to set up CRUSH hierarchies and set
> rules for different failure domains. I am particularly confused by the
> 'step take' and 'step choose|chooseleaf' settings for which I think
> are the keys for defining a failure domain in a CRUSH rule.
>
> As for my hypothetical cluster, it is made of 3 racks with 2 hosts on
> each. One host has 3 SSD-based OSDs and the other has 3 HDD-based
> OSDs. I wished to create two rules: one uses SSD-only and another
> HDD-only. Both rules should have a rack level failure domain.
>
> I have attached a diagram that may help to explain my setup. The
> following is my CRUSH map configuration (with all typos fixed) for
> review:
>
> device 0 osd.0 class ssd
> device 1 osd.1 class ssd
> device 2 osd.2 class ssd
> device 3 osd.3 class hdd
> device 4 osd.4 class hdd
> device 5 osd.5 class hdd
> device 6 osd.6 class ssd
> device 7 osd.7 class ssd
> device 8 osd.8 class ssd
> device 9 osd.9 class hdd
> device 10 osd.10 class hdd
> device 11 osd.11 class hdd
> device 12 osd.12 class ssd
> device 13 osd.13 class ssd
> device 14 osd.14 class ssd
> device 15 osd.15 class hdd
> device 16 osd.17 class hdd
> device 17 osd.17 class hdd
>
>   host a1-1 {
>       id -1
>       alg straw
>       hash 0
>       item osd.0 weight 1.00
>       item osd.1 weight 1.00
>       item osd.2 weight 1.00
>   }
>
>   host a1-2 {
>       id -2
>       alg straw
>       hash 0
>       item osd.3 weight 1.00
>       item osd.4 weight 1.00
>       item osd.5 weight 1.00
>   }
>
>   host a2-1 {
>       id -3
>       alg straw
>       hash 0
>       item osd.6 weight 1.00
>       item osd.7 weight 1.00
>       item osd.8 weight 1.00
>   }
>
>   host a2-2 {
>       id -4
>       alg straw
>       hash 0
>       item osd.9 weight 1.00
>       item osd.10 weight 1.00
>       item osd.11 weight 1.00
>   }
>
>   host a3-1 {
>       id -5
>       alg straw
>       hash 0
>       item osd.12 weight 1.00
>       item osd.13 weight 1.00
>       item osd.14 weight 1.00
>   }
>
>   host a3-2 {
>       id -6
>       alg straw
>       hash 0
>       item osd.15 weight 1.00
>       item osd.16 weight 1.00
>       item osd.17 weight 1.00
>   }
>
>   rack a1 {
>       id -7
>       alg straw
>       hash 0
>       item a1-1 weight 3.0
>       item a1-2 weight 3.0
>   }
>
>   rack a2 {
>       id -5
>       alg straw
>       hash 0
>       item a2-1 weight 3.0
>       item a2-2 weight 3.0
>   }
>
>   rack a3 {
>       id -6
>       alg straw
>       hash 0
>       item a3-1 weight 3.0
>       item a3-2 weight 3.0
>   }
>
>   row a {
>       id -7
>       alg straw
>       hash 0
>       item a1 6.0
>       item a2 6.0
>       item a3 6.0
>   }
>
>   rule ssd {
>       id 1
>       type replicated
>       min_size 2
>       max_size 11
>       step take a class ssd
>       step chooseleaf firstn 0 type rack
>       step emit
>   }
>
>   rule hdd {
>       id 2
>       type replicated
>       min_size 2
>       max_size 11
>       step take a class hdd
>       step chooseleaf firstn 0 type rack
>       step emit
>   }
>
>
> Are the two rules correct?
>
>
>
> Times when you need manually edit CRUSH map is gone. Manual editing even in your case has already lead to errors.
>
>
>
> # create new datacenter and move it to default root
> ceph osd crush add-bucket new_datacenter datacenter
> ceph osd crush move new_datacenter root=default
> # create our racks
> ceph osd crush add-bucket rack_a1 rack
> ceph osd crush add-bucket rack_a2 rack
> ceph osd crush add-bucket rack_a3 rack
> # move our racks to our datacenter
> ceph osd crush move rack_a1 datacenter=new_datacenter
> ceph osd crush move rack_a2 datacenter=new_datacenter
> ceph osd crush move rack_a3 datacenter=new_datacenter
> # create our hosts
> ceph osd crush add-bucket host_a1-1 host
> ceph osd crush add-bucket host_a1-2 host
> ceph osd crush add-bucket host_a2-1 host
> ceph osd crush add-bucket host_a2-2 host
> ceph osd crush add-bucket host_a3-1 host
> ceph osd crush add-bucket host_a3-2 host
> # and move it to racks
> ceph osd crush move host_a1-1 rack=rack_a1
> ceph osd crush move host_a1-2 rack=rack_a1
> ceph osd crush move host_a2-1 rack=rack_a2
> ceph osd crush move host_a2-2 rack=rack_a2
> ceph osd crush move host_a3-1 rack=rack_a3
> ceph osd crush move host_a3-2 rack=rack_a3
> # now it's time to deploy osds. when osds is 'up' and 'in' and properly class
> # is assigned we can move it hosts. In case when class is wrong, i.e.
> # 'nvme' is detected as 'ssd' we can rewrite device class like this:
> ceph osd crush rm-device-class osd.5
> ceph osd crush set-device-class nvme osd.5
> # okay, `ceph osd tree` show our osds with device classes, move it to hosts:
> ceph osd crush move osd.0 host=host_a1-1
> ceph osd crush move osd.1 host=host_a1-1
> ceph osd crush move osd.2 host=host_a1-1
> ceph osd crush move osd.3 host=host_a1-2
> ceph osd crush move osd.4 host=host_a1-2
> ceph osd crush move osd.5 host=host_a1-2
> <etc>...</etc>
> # when this done we should reweight osds on crush map
> # ssd drives is 960Gb
> ceph osd crush reweight osd.0 0.960
> ceph osd crush reweight osd.1 0.960
> ceph osd crush reweight osd.2 0.960
> # hdd drives is 6Tb
> ceph osd crush reweight osd.3 5.5
> ceph osd crush reweight osd.4 5.5
> ceph osd crush reweight osd.5 5.5
> <etc>...</etc>
> # crush map is ready, now is time to crush rules
> ## New replication rules with device classes
> ceph osd crush rule create-replicated replicated_racks_hdd default rack hdd
> ceph osd crush rule create-replicated replicated_racks_ssd default rack ssd
> # create new pool with predefined crush rule
> ceph osd pool create replicated_rbd_hdd 128 128 replicated replicated_racks_hdd
> # our failure domain is rack
> ceph osd pool set replicated_rbd_hdd min_size 2
> ceph osd pool set replicated_rbd_hdd size 3
> ceph osd pool application enable replicated_rbd_hdd rbd
> # or assign crush rule to existing pool
> ceph osd pool set replicated_rbd_ssd crush_rule replicated_racks_ssd
>
>
>
> As you can see, when you use cli your values will be validated before implemented, this help in avoid human mistakes. All changes you can detect online, if something wrong you have easy access to 'ceph osd tree', 'ceph osd pool ls detail' and 'ceph osd crush rule dump'. I hope this will help to novice in crush understanding.
>
>
>
>
>
> k
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com