Re: Crushmap Design Question

Wido den Hollander <wido@xxxxxxxxx> · Wed, 09 Jan 2013 09:59:33 +0100

Hi,

On 01/09/2013 01:53 AM, Chen, Xiaoxi wrote:
> Hi，
> 	Setting rep size to 3 only make the data triple-replication, that means when you "fail" all OSDs in 2 out of 3 DCs, the data still accessable.
> 	But Monitor is another story, for monitor clusters with 2N+1 nodes, it require at least N+1 nodes alive, and indeed this is why you Ceph failed.
> 	It looks to me this discipline make it hard to design a proper deployment which is robust in DC outage. But hoping for inputs from community,how to make Monitor cluster reliable.
> 

>From what I understand he didn't kill the second mon, still leaving 2
out of 3 mons running.

Could you check if your PGs are actually mapped to OSDs spread out over
the 3 DCs?

"ceph pg dump" should tell you to which OSDs the PGs are mapped.

I've never tried before, but you don't have equal weights for the
datacenters, I don't know how that effects the situation.

Wido

> 																												Xiaoxi
> 
> 
> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Moore, Shawn M
> Sent: 2013年1月9日 4:21
> To: ceph-devel@xxxxxxxxxxxxxxx
> Subject: Crushmap Design Question
> 
> I have been testing ceph for a little over a month now.  Our design goal is to have 3 datacenters in different buildings all tied together over 10GbE.  Currently there are 10 servers each serving 1 osd in 2 of the datacenters.  In the third is one large server with 16 SAS disks serving 8 osds.  Eventually we will add one more identical large server into the third datacenter.  I have told ceph to keep 3 copies and tried to do the crushmap in such a way that as long as a majority of mon's can stay up, we could run off of one datacenter's worth of osds.   So in my testing, it doesn't work out quite this way...
> 
> Everything is currently ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
> 
> I will put hopefully relevant files at the end of this email.
> 
> When all 28 osds are up, I get:
> 2013-01-08 13:56:07.435914 mon.0 [INF] pgmap v2712076: 7104 pgs: 7104 active+clean; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail
> 
> When I fail a datacenter (including 1 of 3 mon's) I eventually get:
> 2013-01-08 13:58:54.020477 mon.0 [INF] pgmap v2712139: 7104 pgs: 7104 active+degraded; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail; 16362/49086 degraded (33.333%)
> 
> At this point everything is still ok.  But when I fail the 2nd datacenter (still leaving 2 out of 3 mons running) I get:
> 2013-01-08 14:01:25.600056 mon.0 [INF] pgmap v2712189: 7104 pgs: 7104 incomplete; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail
> 
> Most VM's quit working and "rbd ls" works, but not a single line from "rados -p rbd ls" works and the command hangs.  Now after a while (you can see from timestamps) I end up at and stays this way:
> 2013-01-08 14:40:54.030370 mon.0 [INF] pgmap v2713794: 7104 pgs: 213 active, 117 active+remapped, 3660 incomplete, 3108 active+degraded+remapped, 6 remapped+incomplete; 60264 MB data, 65701 MB used, 4604 GB / 4768 GB avail; 7696/49086 degraded (15.679%)
> 
> I'm hoping I've done something wrong, so please advise.  Below are my configs.  If you need something more to help, just ask.
> 
> Normal output with all datacenters up.
> # ceph osd tree
> # id	weight	type name	up/down	reweight
> -1	80	root default
> -3	36		datacenter hok
> -2	1			host blade151
> 0	1				osd.0	up	1	
> -4	1			host blade152
> 1	1				osd.1	up	1	
> -15	1			host blade153
> 2	1				osd.2	up	1	
> -17	1			host blade154
> 3	1				osd.3	up	1	
> -18	1			host blade155
> 4	1				osd.4	up	1	
> -19	1			host blade159
> 5	1				osd.5	up	1	
> -20	1			host blade160
> 6	1				osd.6	up	1	
> -21	1			host blade161
> 7	1				osd.7	up	1	
> -22	1			host blade162
> 8	1				osd.8	up	1	
> -23	1			host blade163
> 9	1				osd.9	up	1	
> -24	36		datacenter csc
> -5	1			host admbc0-01
> 10	1				osd.10	up	1	
> -6	1			host admbc0-02
> 11	1				osd.11	up	1	
> -7	1			host admbc0-03
> 12	1				osd.12	up	1	
> -8	1			host admbc0-04
> 13	1				osd.13	up	1	
> -9	1			host admbc0-05
> 14	1				osd.14	up	1	
> -10	1			host admbc0-06
> 15	1				osd.15	up	1	
> -11	1			host admbc0-09
> 16	1				osd.16	up	1	
> -12	1			host admbc0-10
> 17	1				osd.17	up	1	
> -13	1			host admbc0-11
> 18	1				osd.18	up	1	
> -14	1			host admbc0-12
> 19	1				osd.19	up	1	
> -25	8		datacenter adm
> -16	8			host admdisk0
> 20	1				osd.20	up	1	
> 21	1				osd.21	up	1	
> 22	1				osd.22	up	1	
> 23	1				osd.23	up	1	
> 24	1				osd.24	up	1	
> 25	1				osd.25	up	1	
> 26	1				osd.26	up	1	
> 27	1				osd.27	up	1
> 
> 
> 
> Showing copes set to 3.
> # ceph osd dump | grep " size "
> pool 0 'data' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 63 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 3 crush_ruleset 1 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 65 owner 0 pool 2 'rbd' rep size 3 crush_ruleset 2 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 6061 owner 0
> 
> 
> 
> 
> Crushmap
> # begin crush map
> 
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 osd.4
> device 5 osd.5
> device 6 osd.6
> device 7 osd.7
> device 8 osd.8
> device 9 osd.9
> device 10 osd.10
> device 11 osd.11
> device 12 osd.12
> device 13 osd.13
> device 14 osd.14
> device 15 osd.15
> device 16 osd.16
> device 17 osd.17
> device 18 osd.18
> device 19 osd.19
> device 20 osd.20
> device 21 osd.21
> device 22 osd.22
> device 23 osd.23
> device 24 osd.24
> device 25 osd.25
> device 26 osd.26
> device 27 osd.27
> 
> # types
> type 0 osd
> type 1 host
> type 2 rack
> type 3 row
> type 4 room
> type 5 datacenter
> type 6 root
> 
> # buckets
> host blade151 {
> 	id -2		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.0 weight 1.000
> }
> host blade152 {
> 	id -4		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.1 weight 1.000
> }
> host blade153 {
> 	id -15		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.2 weight 1.000
> }
> host blade154 {
> 	id -17		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.3 weight 1.000
> }
> host blade155 {
> 	id -18		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.4 weight 1.000
> }
> host blade159 {
> 	id -19		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.5 weight 1.000
> }
> host blade160 {
> 	id -20		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.6 weight 1.000
> }
> host blade161 {
> 	id -21		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.7 weight 1.000
> }
> host blade162 {
> 	id -22		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.8 weight 1.000
> }
> host blade163 {
> 	id -23		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.9 weight 1.000
> }
> datacenter hok {
> 	id -3		# do not change unnecessarily
> 	# weight 10.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item blade151 weight 1.000
> 	item blade152 weight 1.000
> 	item blade153 weight 1.000
> 	item blade154 weight 1.000
> 	item blade155 weight 1.000
> 	item blade159 weight 1.000
> 	item blade160 weight 1.000
> 	item blade161 weight 1.000
> 	item blade162 weight 1.000
> 	item blade163 weight 1.000
> }
> host admbc0-01 {
> 	id -5		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.10 weight 1.000
> }
> host admbc0-02 {
> 	id -6		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.11 weight 1.000
> }
> host admbc0-03 {
> 	id -7		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.12 weight 1.000
> }
> host admbc0-04 {
> 	id -8		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.13 weight 1.000
> }
> host admbc0-05 {
> 	id -9		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.14 weight 1.000
> }
> host admbc0-06 {
> 	id -10		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.15 weight 1.000
> }
> host admbc0-09 {
> 	id -11		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.16 weight 1.000
> }
> host admbc0-10 {
> 	id -12		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.17 weight 1.000
> }
> host admbc0-11 {
> 	id -13		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.18 weight 1.000
> }
> host admbc0-12 {
> 	id -14		# do not change unnecessarily
> 	# weight 1.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.19 weight 1.000
> }
> datacenter csc {
> 	id -24		# do not change unnecessarily
> 	# weight 10.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item admbc0-01 weight 1.000
> 	item admbc0-02 weight 1.000
> 	item admbc0-03 weight 1.000
> 	item admbc0-04 weight 1.000
> 	item admbc0-05 weight 1.000
> 	item admbc0-06 weight 1.000
> 	item admbc0-09 weight 1.000
> 	item admbc0-10 weight 1.000
> 	item admbc0-11 weight 1.000
> 	item admbc0-12 weight 1.000
> }
> host admdisk0 {
> 	id -16		# do not change unnecessarily
> 	# weight 8.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.20 weight 1.000
> 	item osd.21 weight 1.000
> 	item osd.22 weight 1.000
> 	item osd.23 weight 1.000
> 	item osd.24 weight 1.000
> 	item osd.25 weight 1.000
> 	item osd.26 weight 1.000
> 	item osd.27 weight 1.000
> }
> datacenter adm {
> 	id -25		# do not change unnecessarily
> 	# weight 8.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item admdisk0 weight 8.000
> }
> root default {
> 	id -1		# do not change unnecessarily
> 	# weight 80.000
> 	alg straw
> 	hash 0	# rjenkins1
> 	item hok weight 36.000
> 	item csc weight 36.000
> 	item adm weight 8.000
> }
> 
> # rules
> rule data {
> 	ruleset 0
> 	type replicated
> 	min_size 1
> 	max_size 10
> 	step take default
> 	step chooseleaf firstn 0 type datacenter
> 	step emit
> }
> rule metadata {
> 	ruleset 1
> 	type replicated
> 	min_size 1
> 	max_size 10
> 	step take default
> 	step chooseleaf firstn 0 type datacenter
> 	step emit
> }
> rule rbd {
> 	ruleset 2
> 	type replicated
> 	min_size 1
> 	max_size 10
> 	step take default
> 	step chooseleaf firstn 0 type datacenter
> 	step emit
> }
> 
> # end crush map
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html
> N?叉??y??b??千v??)藓{.n?+???z鳐?ay????,j?f＂?????ア?⒎?:+v????????赙zZ+????"?!tml=
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html