Re: CRUSH Rule Review - Not replicating correctly

deeepdish <deeepdish@xxxxxxxxx> · Wed, 20 Jan 2016 09:54:54 -0500

Hi Robert,
Just wanted to let you know that after applying your crush suggestion and allowing cluster to rebalance itself, I now have symmetrical data distribution.   In keeping 5 monitors my rationale is availability.   I have 3 compute nodes + 2 storage nodes.   I was thinking that making all of them a monitor would provide an additional backups.  Based on your earlier comments, can you provide guidance on how much latency is induced by having excess monitors deployed?

Thanks.

On Jan 18, 2016, at 12:36 , Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Not that I know of.
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Mon, Jan 18, 2016 at 10:33 AM, deeepdish  wrote:
Thanks Robert.   Will definitely try this.   Is there a way to implement “gradual CRUSH” changes?   I noticed whenever cluster wide changes are pushed (crush map, for instance) the cluster immediately attempts to align itself disrupting client access / performance…

On Jan 18, 2016, at 12:22 , Robert LeBlanc  wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I'm not sure why you have six monitors. Six monitors buys you nothing
over five monitors other than more power being used, and more latency
and more headache. See
http://docs.ceph.com/docs/hammer/rados/configuration/mon-config-ref/#monitor-quorum
for some more info. Also, I'd consider 5 monitors overkill for this
size cluster, I'd recommend three.

Although this is most likely not the root cause of your problem, you
probably have an error here: "root replicated-T1" is pointing to
b02s08 and b02s12 and "site erbus" is also pointing to b02s08 and
b02s12. You probably meant to have "root replicated-T1" pointing to
erbus instead.

Where I think your problem is, is in your "rule replicated" section.
You can try:
step take replicated-T1
step choose firstn 2 type host
step chooseleaf firstn 2 type osdgroup
step emit

What this does is choose two hosts from the root replicated-T1 (which
happens to be both hosts you have), then chooses an OSD from two
osdgroups on each host.

I believe the problem with your current rule set is that firstn 0 type
host tries to select four hosts, but only two are available. You
should be able to see that with 'ceph pg dump', where only two osds
will be listed in the up set.

I hope that helps.
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.3.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWnR9kCRDmVDuy+mK58QAA5hUP/iJprG4nGR2sJvL//8l+
V6oLYXTCs8lHeKL3ZPagThE9oh2xDMV37WR3I/xMNTA8735grl8/AAhy8ypW
MDOikbpzfWnlaL0SWs5rIQ5umATwv73Fg/Mf+K2Olt8IGP6D0NMIxfeOjU6E
0Sc3F37nDQFuDEkBYjcVcqZC89PByh7yaId+eOgr7Ot+BZL/3fbpWIZ9kyD5
KoPYdPjtFruoIpc8DJydzbWdmha65DkB65QOZlI3F3lMc6LGXUopm4OP4sQd
txVKFtTcLh97WgUshQMSWIiJiQT7+3D6EqQyPzlnei3O3gACpkpsmUteDPpn
p8CDeJtIpgKnQZjBwfK/bUQXdIGem8Y0x/PC+1ekIhkHCIJeW2sD3mFJduDQ
9loQ9+IsWHfQmEHLMLdeNzRXbgBY2djxP2X70fXTg31fx+dYvbWeulYJHiKi
1fJS4GdbPjoRUp5k4lthk3hDTFD/f5ZuowLDIaexgISb0bIJcObEn9RWlHut
IRVi0fUuRVIX3snGMOKjLmSUe87Od2KSEbULYPTLYDMo/FsWXWHNlP3gVKKd
lQJdxcwXOW7/v5oayY4wiEE6NF4rCupcqt0nPxxmbehmeRPxgkWCKJJs3FNr
VmUdnrdpfxzR5c8dmOELJnpNS6MTT56B8A4kKmqbbHCEKpZ83piG7uwqc+6f
RKkQ
=gp/0
-----END PGP SIGNATURE-----
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Sun, Jan 17, 2016 at 6:31 PM, deeepdish  wrote:
Hi Everyone,

Looking for a double check of my logic and crush map..

Overview:

- osdgroup bucket type defines failure domain within a host of 5 OSDs + 1
SSD.   Therefore 5 OSDs (all utilizing the same journal) constitute an
osdgroup bucket.   Each host has 4 osdgroups.
- 6 monitors
- Two node cluster
- Each node:
- 20 OSDs
-  4 SSDs
- 4 osdgroups

Desired Crush Rule outcome:
- Assuming a pool with min_size=2 and size=4, all each node would contain a
redundant copy of each object.   Should any of the hosts fail, access to
data would be uninterrupted.

Current Crush Rule outcome:
- There are 4 copies of each object, however I don’t believe each node has a
redundant copy of each object, when a node fails, data is NOT accessible
until ceph rebuilds itself / node becomes accessible again.

I susepct my crush is not right, and to remedy it may take some time and
cause cluster to be unresponsive / unavailable.    Is there a way / method
to apply substantial crush changes gradually to a cluster?

Thanks for your help.

Current crush map:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27
device 28 osd.28
device 29 osd.29
device 30 osd.30
device 31 osd.31
device 32 osd.32
device 33 osd.33
device 34 osd.34
device 35 osd.35
device 36 osd.36
device 37 osd.37
device 38 osd.38
device 39 osd.39

# types
type 0 osd
type 1 osdgroup
type 2 host
type 3 rack
type 4 site
type 5 root

# buckets
osdgroup b02s08-osdgroupA {
id -81 # do not change unnecessarily
# weight 18.100
alg straw
hash 0 # rjenkins1
item osd.0 weight 3.620
item osd.1 weight 3.620
item osd.2 weight 3.620
item osd.3 weight 3.620
item osd.4 weight 3.620
}
osdgroup b02s08-osdgroupB {
id -82 # do not change unnecessarily
# weight 18.100
alg straw
hash 0 # rjenkins1
item osd.5 weight 3.620
item osd.6 weight 3.620
item osd.7 weight 3.620
item osd.8 weight 3.620
item osd.9 weight 3.620
}
osdgroup b02s08-osdgroupC {
id -83 # do not change unnecessarily
# weight 19.920
alg straw
hash 0 # rjenkins1
item osd.10 weight 3.620
item osd.11 weight 3.620
item osd.12 weight 3.620
item osd.13 weight 3.620
item osd.14 weight 5.440
}
osdgroup b02s08-osdgroupD {
id -84 # do not change unnecessarily
# weight 19.920
alg straw
hash 0 # rjenkins1
item osd.15 weight 3.620
item osd.16 weight 3.620
item osd.17 weight 3.620
item osd.18 weight 3.620
item osd.19 weight 5.440
}
host b02s08 {
id -80 # do not change unnecessarily
# weight 76.040
alg straw
hash 0 # rjenkins1
item b02s08-osdgroupA weight 18.100
item b02s08-osdgroupB weight 18.100
item b02s08-osdgroupC weight 19.920
item b02s08-osdgroupD weight 19.920
}
osdgroup b02s12-osdgroupA {
id -121 # do not change unnecessarily
# weight 18.100
alg straw
hash 0 # rjenkins1
item osd.20 weight 3.620
item osd.21 weight 3.620
item osd.22 weight 3.620
item osd.23 weight 3.620
item osd.24 weight 3.620
}
osdgroup b02s12-osdgroupB {
id -122 # do not change unnecessarily
# weight 18.100
alg straw
hash 0 # rjenkins1
item osd.25 weight 3.620
item osd.26 weight 3.620
item osd.27 weight 3.620
item osd.28 weight 3.620
item osd.29 weight 3.620
}
osdgroup b02s12-osdgroupC {
id -123 # do not change unnecessarily
# weight 19.920
alg straw
hash 0 # rjenkins1
item osd.30 weight 3.620
item osd.31 weight 3.620
item osd.32 weight 3.620
item osd.33 weight 3.620
item osd.34 weight 5.440
}
osdgroup b02s12-osdgroupD {
id -124 # do not change unnecessarily
# weight 19.920
alg straw
hash 0 # rjenkins1
item osd.35 weight 3.620
item osd.36 weight 3.620
item osd.37 weight 3.620
item osd.38 weight 3.620
item osd.39 weight 5.440
}
host b02s12 {
id -120 # do not change unnecessarily
# weight 76.040
alg straw
hash 0 # rjenkins1
item b02s12-osdgroupA weight 18.100
item b02s12-osdgroupB weight 18.100
item b02s12-osdgroupC weight 19.920
item b02s12-osdgroupD weight 19.920
}
root replicated-T1 {
id -1 # do not change unnecessarily
# weight 152.080
alg straw
hash 0 # rjenkins1
item b02s08 weight 76.040
item b02s12 weight 76.040
}
rack b02 {
id -20 # do not change unnecessarily
# weight 152.080
alg straw
hash 0 # rjenkins1
item b02s08 weight 76.040
item b02s12 weight 76.040
}
site erbus {
id -10 # do not change unnecessarily
# weight 152.080
alg straw
hash 0 # rjenkins1
item b02 weight 152.080
}

# rules
rule replicated {
ruleset 0
type replicated
min_size 1
max_size 10
step take replicated-T1
step choose firstn 0 type host
step chooseleaf firstn 0 type osdgroup
step emit
}

# end crush map

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.3.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWnSKNCRDmVDuy+mK58QAATgIQAIHVBvSoQ2pQ6/J/+KI6
5TfjqAhJ3Q7E9JwC0suZ9JRORhBcrbab5wnY4oMabLeAEazTST5gAedeMBV4
vFA1aG5RBpUcir1+49BZYpHuUuJuvviTSrVjojbr6eISvsJfFwq7BosQZw7h
DxExk8Pm5l8cDXd2z03f34F7xDfX3u0UsLm/TCTfzxFAmwngkC6rsElJSXp+
MQ9mjBIncx+HkDWyshJjKBqhhXVfOa+euUuCcmTlIiGgIaA5PXNG+q+OJMnq
0ONb9TF51ApW3NvIgRMKo94g+rw7wSEJbe7LkzJOgkJ19rrLrit3uhCOLDha
iF2ELgd9jNRbsODUd0iTU9DgecoWuCZMsCdpeYyoN+BO1OLAdNjgQH9JaHnx
JeIT538/x8gSi8S7We0FjdPvY0dbIMneROocK8/e+byboindodfV0z2YJ4C3
kEkweIW+45PHjQWLPU6SVYtdZyHoUPOrCpEOTo/9uOILx8nmvcY2SEhyuiFd
6QfZKKCwPmhNDNB+UUPzrd8cp794RDp9bYue3Ql5L1K4Yln0nXEvzVoNp8eB
JXjsFXkPMBB8njiS7E4e7CGc64azVlagGZ+H99jbCLVdyaTrT+9+/WGwl1Ut
OzDhwuU/dPYodT2ULPYtrMN03LoKozd2MKh0wjebwONJOUgMxUvGLcffSmMb
/TJ6
=tLy3
-----END PGP SIGNATURE-----

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com