RE: Crushmap Design Question

"Chen, Xiaoxi" <xiaoxi.chen@xxxxxxxxx> · Wed, 9 Jan 2013 00:53:13 +0000

Hi，
	Setting rep size to 3 only make the data triple-replication, that means when you "fail" all OSDs in 2 out of 3 DCs, the data still accessable.
	But Monitor is another story, for monitor clusters with 2N+1 nodes, it require at least N+1 nodes alive, and indeed this is why you Ceph failed.
	It looks to me this discipline make it hard to design a proper deployment which is robust in DC outage. But hoping for inputs from community,how to make Monitor cluster reliable.

																												Xiaoxi

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Moore, Shawn M
Sent: 2013年1月9日 4:21
To: ceph-devel@xxxxxxxxxxxxxxx
Subject: Crushmap Design Question

I have been testing ceph for a little over a month now.  Our design goal is to have 3 datacenters in different buildings all tied together over 10GbE.  Currently there are 10 servers each serving 1 osd in 2 of the datacenters.  In the third is one large server with 16 SAS disks serving 8 osds.  Eventually we will add one more identical large server into the third datacenter.  I have told ceph to keep 3 copies and tried to do the crushmap in such a way that as long as a majority of mon's can stay up, we could run off of one datacenter's worth of osds.   So in my testing, it doesn't work out quite this way...

Everything is currently ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)

I will put hopefully relevant files at the end of this email.

When all 28 osds are up, I get:
2013-01-08 13:56:07.435914 mon.0 [INF] pgmap v2712076: 7104 pgs: 7104 active+clean; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail

When I fail a datacenter (including 1 of 3 mon's) I eventually get:
2013-01-08 13:58:54.020477 mon.0 [INF] pgmap v2712139: 7104 pgs: 7104 active+degraded; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail; 16362/49086 degraded (33.333%)

At this point everything is still ok.  But when I fail the 2nd datacenter (still leaving 2 out of 3 mons running) I get:
2013-01-08 14:01:25.600056 mon.0 [INF] pgmap v2712189: 7104 pgs: 7104 incomplete; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail

Most VM's quit working and "rbd ls" works, but not a single line from "rados -p rbd ls" works and the command hangs.  Now after a while (you can see from timestamps) I end up at and stays this way: 
2013-01-08 14:40:54.030370 mon.0 [INF] pgmap v2713794: 7104 pgs: 213 active, 117 active+remapped, 3660 incomplete, 3108 active+degraded+remapped, 6 remapped+incomplete; 60264 MB data, 65701 MB used, 4604 GB / 4768 GB avail; 7696/49086 degraded (15.679%)

I'm hoping I've done something wrong, so please advise.  Below are my configs.  If you need something more to help, just ask.

Normal output with all datacenters up.
# ceph osd tree
# id	weight	type name	up/down	reweight
-1	80	root default
-3	36		datacenter hok
-2	1			host blade151
0	1				osd.0	up	1	
-4	1			host blade152
1	1				osd.1	up	1	
-15	1			host blade153
2	1				osd.2	up	1	
-17	1			host blade154
3	1				osd.3	up	1	
-18	1			host blade155
4	1				osd.4	up	1	
-19	1			host blade159
5	1				osd.5	up	1	
-20	1			host blade160
6	1				osd.6	up	1	
-21	1			host blade161
7	1				osd.7	up	1	
-22	1			host blade162
8	1				osd.8	up	1	
-23	1			host blade163
9	1				osd.9	up	1	
-24	36		datacenter csc
-5	1			host admbc0-01
10	1				osd.10	up	1	
-6	1			host admbc0-02
11	1				osd.11	up	1	
-7	1			host admbc0-03
12	1				osd.12	up	1	
-8	1			host admbc0-04
13	1				osd.13	up	1	
-9	1			host admbc0-05
14	1				osd.14	up	1	
-10	1			host admbc0-06
15	1				osd.15	up	1	
-11	1			host admbc0-09
16	1				osd.16	up	1	
-12	1			host admbc0-10
17	1				osd.17	up	1	
-13	1			host admbc0-11
18	1				osd.18	up	1	
-14	1			host admbc0-12
19	1				osd.19	up	1	
-25	8		datacenter adm
-16	8			host admdisk0
20	1				osd.20	up	1	
21	1				osd.21	up	1	
22	1				osd.22	up	1	
23	1				osd.23	up	1	
24	1				osd.24	up	1	
25	1				osd.25	up	1	
26	1				osd.26	up	1	
27	1				osd.27	up	1

Showing copes set to 3.
# ceph osd dump | grep " size "
pool 0 'data' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 63 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 3 crush_ruleset 1 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 65 owner 0 pool 2 'rbd' rep size 3 crush_ruleset 2 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 6061 owner 0

Crushmap
# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root

# buckets
host blade151 {
	id -2		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.0 weight 1.000
}
host blade152 {
	id -4		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.1 weight 1.000
}
host blade153 {
	id -15		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.2 weight 1.000
}
host blade154 {
	id -17		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.3 weight 1.000
}
host blade155 {
	id -18		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.4 weight 1.000
}
host blade159 {
	id -19		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.5 weight 1.000
}
host blade160 {
	id -20		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.6 weight 1.000
}
host blade161 {
	id -21		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.7 weight 1.000
}
host blade162 {
	id -22		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.8 weight 1.000
}
host blade163 {
	id -23		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.9 weight 1.000
}
datacenter hok {
	id -3		# do not change unnecessarily
	# weight 10.000
	alg straw
	hash 0	# rjenkins1
	item blade151 weight 1.000
	item blade152 weight 1.000
	item blade153 weight 1.000
	item blade154 weight 1.000
	item blade155 weight 1.000
	item blade159 weight 1.000
	item blade160 weight 1.000
	item blade161 weight 1.000
	item blade162 weight 1.000
	item blade163 weight 1.000
}
host admbc0-01 {
	id -5		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.10 weight 1.000
}
host admbc0-02 {
	id -6		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.11 weight 1.000
}
host admbc0-03 {
	id -7		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.12 weight 1.000
}
host admbc0-04 {
	id -8		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.13 weight 1.000
}
host admbc0-05 {
	id -9		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.14 weight 1.000
}
host admbc0-06 {
	id -10		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.15 weight 1.000
}
host admbc0-09 {
	id -11		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.16 weight 1.000
}
host admbc0-10 {
	id -12		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.17 weight 1.000
}
host admbc0-11 {
	id -13		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.18 weight 1.000
}
host admbc0-12 {
	id -14		# do not change unnecessarily
	# weight 1.000
	alg straw
	hash 0	# rjenkins1
	item osd.19 weight 1.000
}
datacenter csc {
	id -24		# do not change unnecessarily
	# weight 10.000
	alg straw
	hash 0	# rjenkins1
	item admbc0-01 weight 1.000
	item admbc0-02 weight 1.000
	item admbc0-03 weight 1.000
	item admbc0-04 weight 1.000
	item admbc0-05 weight 1.000
	item admbc0-06 weight 1.000
	item admbc0-09 weight 1.000
	item admbc0-10 weight 1.000
	item admbc0-11 weight 1.000
	item admbc0-12 weight 1.000
}
host admdisk0 {
	id -16		# do not change unnecessarily
	# weight 8.000
	alg straw
	hash 0	# rjenkins1
	item osd.20 weight 1.000
	item osd.21 weight 1.000
	item osd.22 weight 1.000
	item osd.23 weight 1.000
	item osd.24 weight 1.000
	item osd.25 weight 1.000
	item osd.26 weight 1.000
	item osd.27 weight 1.000
}
datacenter adm {
	id -25		# do not change unnecessarily
	# weight 8.000
	alg straw
	hash 0	# rjenkins1
	item admdisk0 weight 8.000
}
root default {
	id -1		# do not change unnecessarily
	# weight 80.000
	alg straw
	hash 0	# rjenkins1
	item hok weight 36.000
	item csc weight 36.000
	item adm weight 8.000
}

# rules
rule data {
	ruleset 0
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type datacenter
	step emit
}
rule metadata {
	ruleset 1
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type datacenter
	step emit
}
rule rbd {
	ruleset 2
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type datacenter
	step emit
}

# end crush map

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html
?韬{.n?????%??檩??w?{.n????u朕?Ф?塄}?财??j:+v??????2??璀??摺?囤??z夸z罐?+?????w棹f