Re: Cluster Map Problems

Martin Mailand <martin@xxxxxxxxxxxx> · Thu, 28 Mar 2013 18:54:18 +0100

Hi Greg,

I have a custom crush map, which I attached below.
My Goal is it to have two racks, each rack should be a failure domain.
That means for the rbd pool, which I use with a replication level of
two, I want one replica in one rack and the other replica in the other
rack. So that I could loose a whole rack and still all data is available.

At the moment I just shut down one host in one of the racks. I would
expect that the now missing objects get replicated from the other rack
to the remaining host in the first rack, where I shut down one host.

But with my crushmap that doesn't work, therefore I think my crushmap is
not right.

-martin

# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root

# buckets
host store1 {
	id -5		# do not change unnecessarily
	# weight 4.000
	alg straw
	hash 0	# rjenkins1
	item osd.0 weight 1.000
	item osd.1 weight 1.000
	item osd.2 weight 1.000
	item osd.3 weight 1.000
}
host store3 {
	id -7		# do not change unnecessarily
	# weight 4.000
	alg straw
	hash 0	# rjenkins1
	item osd.10 weight 1.000
	item osd.11 weight 1.000
	item osd.8 weight 1.000
	item osd.9 weight 1.000
}
host store4 {
	id -8		# do not change unnecessarily
	# weight 4.000
	alg straw
	hash 0	# rjenkins1
	item osd.12 weight 1.000
	item osd.13 weight 1.000
	item osd.14 weight 1.000
	item osd.15 weight 1.000
}
host store5 {
	id -9		# do not change unnecessarily
	# weight 4.000
	alg straw
	hash 0	# rjenkins1
	item osd.16 weight 1.000
	item osd.17 weight 1.000
	item osd.18 weight 1.000
	item osd.19 weight 1.000
}
host store6 {
	id -10		# do not change unnecessarily
	# weight 4.000
	alg straw
	hash 0	# rjenkins1
	item osd.20 weight 1.000
	item osd.21 weight 1.000
	item osd.22 weight 1.000
	item osd.23 weight 1.000
}
host store2 {
	id -6		# do not change unnecessarily
	# weight 4.000
	alg straw
	hash 0	# rjenkins1
	item osd.4 weight 1.000
	item osd.5 weight 1.000
	item osd.6 weight 1.000
	item osd.7 weight 1.000
}
rack rack1 {
	id -3		# do not change unnecessarily
	# weight 12.000
	alg straw
	hash 0	# rjenkins1
	item store1 weight 4.000
	item store2 weight 4.000
	item store3 weight 4.000
}
rack rack2 {
	id -4		# do not change unnecessarily
	# weight 12.000
	alg straw
	hash 0	# rjenkins1
	item store4 weight 4.000
	item store5 weight 4.000
	item store6 weight 4.000
}
root default {
	id -1		# do not change unnecessarily
	# weight 24.000
	alg straw
	hash 0	# rjenkins1
	item rack1 weight 12.000
	item rack2 weight 12.000
}

# rules
rule data {
	ruleset 0
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type rack
	step emit
}
rule metadata {
	ruleset 1
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type rack
	step emit
}
rule rbd {
	ruleset 2
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type rack
	step emit
}

# end crush map

On 28.03.2013 18:44, Gregory Farnum wrote:
> Looks like you either have a custom config, or have specified
> somewhere that OSDs shouldn't be marked out. (ie, setting the 'noout'
> flag). There can also be a bit of flux if your OSDs are reporting an
> unusual number of failures, but you'd have seen failure reports if
> that were going on.
> -Greg
> 
> On Thu, Mar 28, 2013 at 10:35 AM, Martin Mailand <martin@xxxxxxxxxxxx> wrote:
>> Hi Greg,
>>
>>  /etc/init.d/ceph stop osd.1
>> === osd.1 ===
>> Stopping Ceph osd.1 on store1...kill 13413...done
>> root@store1:~# date -R
>> Thu, 28 Mar 2013 18:22:05 +0100
>> root@store1:~# ceph -s
>>    health HEALTH_WARN 378 pgs degraded; 378 pgs stuck unclean; recovery
>> 39/904 degraded (4.314%);  recovering 15E o/s, 15EB/s; 1/24 in osds are down
>>    monmap e1: 3 mons at
>> {a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0},
>> election epoch 6, quorum 0,1,2 a,b,c
>>    osdmap e28: 24 osds: 23 up, 24 in
>>     pgmap v449: 4800 pgs: 4422 active+clean, 378 active+degraded; 1800
>> MB data, 3800 MB used, 174 TB / 174 TB avail; 39/904 degraded (4.314%);
>>  recovering 15E o/s, 15EB/s
>>    mdsmap e1: 0/0/1 up
>>
>>
>> 10 mins later, still the same
>>
>> root@store1:~# date -R
>> Thu, 28 Mar 2013 18:32:24 +0100
>> root@store1:~# ceph -s
>>    health HEALTH_WARN 378 pgs degraded; 378 pgs stuck unclean; recovery
>> 39/904 degraded (4.314%); 1/24 in osds are down
>>    monmap e1: 3 mons at
>> {a=192.168.195.31:6789/0,b=192.168.195.33:6789/0,c=192.168.195.35:6789/0},
>> election epoch 6, quorum 0,1,2 a,b,c
>>    osdmap e28: 24 osds: 23 up, 24 in
>>     pgmap v454: 4800 pgs: 4422 active+clean, 378 active+degraded; 1800
>> MB data, 3780 MB used, 174 TB / 174 TB avail; 39/904 degraded (4.314%)
>>    mdsmap e1: 0/0/1 up
>>
>> root@store1:~#
>>
>>
>> -martin
>>
>> On 28.03.2013 16:38, Gregory Farnum wrote:
>>> This is the perfectly normal distinction between "down" and "out". The
>>> OSD has been marked down but there's a timeout period (default: 5
>>> minutes) before it's marked "out" and the data gets reshuffled (to
>>> avoid starting replication on a simple reboot, for instance).
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com