further crush map questions

Dietmar Maurer <dietmar@xxxxxxxxxxx> · Wed, 15 Jan 2014 06:06:26 +0000

We observe strange behavior with some configurations. PGs stays in degraded state after
a single OSD failure. 

I can also show the behavior using crushtool  with the following map:

----------crush map---------
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root

# buckets
host prox-ceph-1 {
	id -2		# do not change unnecessarily
	alg straw
	hash 0	# rjenkins1
	item osd.0 weight 1
	item osd.1 weight 1
	item osd.2 weight 1
	item osd.3 weight 1
}
host prox-ceph-2 {
	id -3		# do not change unnecessarily
	# weight 7.260
	alg straw
	hash 0	# rjenkins1
	item osd.4 weight 1
	item osd.5 weight 1
	item osd.6 weight 1
	item osd.7 weight 1
}
host prox-ceph-3 {
	id -4		# do not change unnecessarily
	alg straw
	hash 0	# rjenkins1
	item osd.8 weight 1
	item osd.9 weight 1
	item osd.10 weight 1
	item osd.11 weight 1
}

root default {
	id -1		# do not change unnecessarily
	# weight 21.780
	alg straw
	hash 0	# rjenkins1
	item prox-ceph-1 weight 4
	item prox-ceph-2 weight 4
	item prox-ceph-3 weight 4
}

# rules
rule data {
	ruleset 0
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type host
	step emit
}

# end crush map
--------------------------------------

After compiling that map to ''crush-test.map' we run:

# crushtool --test -i 'crush-test.map' --rule 0 --num-rep 3 --weight 11 0 --show-statistics

I set '--weight 11 0' to mark osd.11 as 'out'. The result is:

...
rule 0 (data) num_rep 3 result size == 2:	111/1024
rule 0 (data) num_rep 3 result size == 3:	913/1024

so 111 PG end up in degraded state. I would expect that the data gets 
re-distributed to the remaining OSDs instead.

Can someone explain why that happens?

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com