Re: CRUSH Rule Review - Not replicating correctly

deeepdish <deeepdish@xxxxxxxxx> · Mon, 18 Jan 2016 12:33:11 -0500

Thanks Robert.   Will definitely try this.   Is there a way to implement “gradual CRUSH” changes?   I noticed whenever cluster wide changes are pushed (crush map, for instance) the cluster immediately attempts to align itself disrupting client access / performance…  

> On Jan 18, 2016, at 12:22 , Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> I'm not sure why you have six monitors. Six monitors buys you nothing
> over five monitors other than more power being used, and more latency
> and more headache. See
> http://docs.ceph.com/docs/hammer/rados/configuration/mon-config-ref/#monitor-quorum
> for some more info. Also, I'd consider 5 monitors overkill for this
> size cluster, I'd recommend three.
> 
> Although this is most likely not the root cause of your problem, you
> probably have an error here: "root replicated-T1" is pointing to
> b02s08 and b02s12 and "site erbus" is also pointing to b02s08 and
> b02s12. You probably meant to have "root replicated-T1" pointing to
> erbus instead.
> 
> Where I think your problem is, is in your "rule replicated" section.
> You can try:
> step take replicated-T1
> step choose firstn 2 type host
> step chooseleaf firstn 2 type osdgroup
> step emit
> 
> What this does is choose two hosts from the root replicated-T1 (which
> happens to be both hosts you have), then chooses an OSD from two
> osdgroups on each host.
> 
> I believe the problem with your current rule set is that firstn 0 type
> host tries to select four hosts, but only two are available. You
> should be able to see that with 'ceph pg dump', where only two osds
> will be listed in the up set.
> 
> I hope that helps.
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.3.3
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWnR9kCRDmVDuy+mK58QAA5hUP/iJprG4nGR2sJvL//8l+
> V6oLYXTCs8lHeKL3ZPagThE9oh2xDMV37WR3I/xMNTA8735grl8/AAhy8ypW
> MDOikbpzfWnlaL0SWs5rIQ5umATwv73Fg/Mf+K2Olt8IGP6D0NMIxfeOjU6E
> 0Sc3F37nDQFuDEkBYjcVcqZC89PByh7yaId+eOgr7Ot+BZL/3fbpWIZ9kyD5
> KoPYdPjtFruoIpc8DJydzbWdmha65DkB65QOZlI3F3lMc6LGXUopm4OP4sQd
> txVKFtTcLh97WgUshQMSWIiJiQT7+3D6EqQyPzlnei3O3gACpkpsmUteDPpn
> p8CDeJtIpgKnQZjBwfK/bUQXdIGem8Y0x/PC+1ekIhkHCIJeW2sD3mFJduDQ
> 9loQ9+IsWHfQmEHLMLdeNzRXbgBY2djxP2X70fXTg31fx+dYvbWeulYJHiKi
> 1fJS4GdbPjoRUp5k4lthk3hDTFD/f5ZuowLDIaexgISb0bIJcObEn9RWlHut
> IRVi0fUuRVIX3snGMOKjLmSUe87Od2KSEbULYPTLYDMo/FsWXWHNlP3gVKKd
> lQJdxcwXOW7/v5oayY4wiEE6NF4rCupcqt0nPxxmbehmeRPxgkWCKJJs3FNr
> VmUdnrdpfxzR5c8dmOELJnpNS6MTT56B8A4kKmqbbHCEKpZ83piG7uwqc+6f
> RKkQ
> =gp/0
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Sun, Jan 17, 2016 at 6:31 PM, deeepdish <deeepdish@xxxxxxxxx> wrote:
>> Hi Everyone,
>> 
>> Looking for a double check of my logic and crush map..
>> 
>> Overview:
>> 
>> - osdgroup bucket type defines failure domain within a host of 5 OSDs + 1
>> SSD.   Therefore 5 OSDs (all utilizing the same journal) constitute an
>> osdgroup bucket.   Each host has 4 osdgroups.
>> - 6 monitors
>> - Two node cluster
>> - Each node:
>> - 20 OSDs
>> -  4 SSDs
>> - 4 osdgroups
>> 
>> Desired Crush Rule outcome:
>> - Assuming a pool with min_size=2 and size=4, all each node would contain a
>> redundant copy of each object.   Should any of the hosts fail, access to
>> data would be uninterrupted.
>> 
>> Current Crush Rule outcome:
>> - There are 4 copies of each object, however I don’t believe each node has a
>> redundant copy of each object, when a node fails, data is NOT accessible
>> until ceph rebuilds itself / node becomes accessible again.
>> 
>> I susepct my crush is not right, and to remedy it may take some time and
>> cause cluster to be unresponsive / unavailable.    Is there a way / method
>> to apply substantial crush changes gradually to a cluster?
>> 
>> Thanks for your help.
>> 
>> 
>> Current crush map:
>> 
>> # begin crush map
>> tunable choose_local_tries 0
>> tunable choose_local_fallback_tries 0
>> tunable choose_total_tries 50
>> tunable chooseleaf_descend_once 1
>> tunable straw_calc_version 1
>> 
>> # devices
>> device 0 osd.0
>> device 1 osd.1
>> device 2 osd.2
>> device 3 osd.3
>> device 4 osd.4
>> device 5 osd.5
>> device 6 osd.6
>> device 7 osd.7
>> device 8 osd.8
>> device 9 osd.9
>> device 10 osd.10
>> device 11 osd.11
>> device 12 osd.12
>> device 13 osd.13
>> device 14 osd.14
>> device 15 osd.15
>> device 16 osd.16
>> device 17 osd.17
>> device 18 osd.18
>> device 19 osd.19
>> device 20 osd.20
>> device 21 osd.21
>> device 22 osd.22
>> device 23 osd.23
>> device 24 osd.24
>> device 25 osd.25
>> device 26 osd.26
>> device 27 osd.27
>> device 28 osd.28
>> device 29 osd.29
>> device 30 osd.30
>> device 31 osd.31
>> device 32 osd.32
>> device 33 osd.33
>> device 34 osd.34
>> device 35 osd.35
>> device 36 osd.36
>> device 37 osd.37
>> device 38 osd.38
>> device 39 osd.39
>> 
>> # types
>> type 0 osd
>> type 1 osdgroup
>> type 2 host
>> type 3 rack
>> type 4 site
>> type 5 root
>> 
>> # buckets
>> osdgroup b02s08-osdgroupA {
>> id -81 # do not change unnecessarily
>> # weight 18.100
>> alg straw
>> hash 0 # rjenkins1
>> item osd.0 weight 3.620
>> item osd.1 weight 3.620
>> item osd.2 weight 3.620
>> item osd.3 weight 3.620
>> item osd.4 weight 3.620
>> }
>> osdgroup b02s08-osdgroupB {
>> id -82 # do not change unnecessarily
>> # weight 18.100
>> alg straw
>> hash 0 # rjenkins1
>> item osd.5 weight 3.620
>> item osd.6 weight 3.620
>> item osd.7 weight 3.620
>> item osd.8 weight 3.620
>> item osd.9 weight 3.620
>> }
>> osdgroup b02s08-osdgroupC {
>> id -83 # do not change unnecessarily
>> # weight 19.920
>> alg straw
>> hash 0 # rjenkins1
>> item osd.10 weight 3.620
>> item osd.11 weight 3.620
>> item osd.12 weight 3.620
>> item osd.13 weight 3.620
>> item osd.14 weight 5.440
>> }
>> osdgroup b02s08-osdgroupD {
>> id -84 # do not change unnecessarily
>> # weight 19.920
>> alg straw
>> hash 0 # rjenkins1
>> item osd.15 weight 3.620
>> item osd.16 weight 3.620
>> item osd.17 weight 3.620
>> item osd.18 weight 3.620
>> item osd.19 weight 5.440
>> }
>> host b02s08 {
>> id -80 # do not change unnecessarily
>> # weight 76.040
>> alg straw
>> hash 0 # rjenkins1
>> item b02s08-osdgroupA weight 18.100
>> item b02s08-osdgroupB weight 18.100
>> item b02s08-osdgroupC weight 19.920
>> item b02s08-osdgroupD weight 19.920
>> }
>> osdgroup b02s12-osdgroupA {
>> id -121 # do not change unnecessarily
>> # weight 18.100
>> alg straw
>> hash 0 # rjenkins1
>> item osd.20 weight 3.620
>> item osd.21 weight 3.620
>> item osd.22 weight 3.620
>> item osd.23 weight 3.620
>> item osd.24 weight 3.620
>> }
>> osdgroup b02s12-osdgroupB {
>> id -122 # do not change unnecessarily
>> # weight 18.100
>> alg straw
>> hash 0 # rjenkins1
>> item osd.25 weight 3.620
>> item osd.26 weight 3.620
>> item osd.27 weight 3.620
>> item osd.28 weight 3.620
>> item osd.29 weight 3.620
>> }
>> osdgroup b02s12-osdgroupC {
>> id -123 # do not change unnecessarily
>> # weight 19.920
>> alg straw
>> hash 0 # rjenkins1
>> item osd.30 weight 3.620
>> item osd.31 weight 3.620
>> item osd.32 weight 3.620
>> item osd.33 weight 3.620
>> item osd.34 weight 5.440
>> }
>> osdgroup b02s12-osdgroupD {
>> id -124 # do not change unnecessarily
>> # weight 19.920
>> alg straw
>> hash 0 # rjenkins1
>> item osd.35 weight 3.620
>> item osd.36 weight 3.620
>> item osd.37 weight 3.620
>> item osd.38 weight 3.620
>> item osd.39 weight 5.440
>> }
>> host b02s12 {
>> id -120 # do not change unnecessarily
>> # weight 76.040
>> alg straw
>> hash 0 # rjenkins1
>> item b02s12-osdgroupA weight 18.100
>> item b02s12-osdgroupB weight 18.100
>> item b02s12-osdgroupC weight 19.920
>> item b02s12-osdgroupD weight 19.920
>> }
>> root replicated-T1 {
>> id -1 # do not change unnecessarily
>> # weight 152.080
>> alg straw
>> hash 0 # rjenkins1
>> item b02s08 weight 76.040
>> item b02s12 weight 76.040
>> }
>> rack b02 {
>> id -20 # do not change unnecessarily
>> # weight 152.080
>> alg straw
>> hash 0 # rjenkins1
>> item b02s08 weight 76.040
>> item b02s12 weight 76.040
>> }
>> site erbus {
>> id -10 # do not change unnecessarily
>> # weight 152.080
>> alg straw
>> hash 0 # rjenkins1
>> item b02 weight 152.080
>> }
>> 
>> # rules
>> rule replicated {
>> ruleset 0
>> type replicated
>> min_size 1
>> max_size 10
>> step take replicated-T1
>> step choose firstn 0 type host
>> step chooseleaf firstn 0 type osdgroup
>> step emit
>> }
>> 
>> # end crush map
>> 
>> 
>> 
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com