Re: crushmap rule issue: choose vs. chooseleaf

"Jim Schutt" <jaschut@xxxxxxxxxx> · Thu, 24 Jun 2010 13:44:20 -0600

On Thu, 2010-06-24 at 12:20 -0600, Sage Weil wrote:
> Hi Jim,
> 
> Okay, I fixed another bug and am now able to use your map without 
> problems.  The fix is pushed to the unstable branch in ceph.git.

Great, thanks!  I really appreciate you being
able to take a look so quickly.

> 
> I'm surprised we didn't run into this before.. it looks like it's been 
> broken for a while.  I'm adding a tracker item to set up some unit tests 
> for this stuff so we can avoid this sort of regression.. the crush code 
> should be really easy to check.

That sounds great.

I'm still having a little trouble, though.

My map works for me now, in the sense that I can mount
the file system from a client.

But when I try to write to it, vmstat on the server shows
I get a little burst of I/O on the servers, and then nothing.

The same ceph config but using the default map works
great - vmstat on the server shows 200-300 MB/s.

FWIW, here's my custom map again, queried 
via ceph osd getcrushmap:

# begin crush map

# devices
device 0 device0
device 1 device1
device 2 device2
device 3 device3

# types
type 0 device
type 1 disk
type 2 controller
type 3 host
type 4 root

# buckets
disk disk0 {
	id -1		# do not change unnecessarily
	alg uniform	# do not change bucket size (1) unnecessarily
	hash 0	# rjenkins1
	item device0 weight 1.000 pos 0
}
disk disk1 {
	id -2		# do not change unnecessarily
	alg uniform	# do not change bucket size (1) unnecessarily
	hash 0	# rjenkins1
	item device1 weight 1.000 pos 0
}
disk disk2 {
	id -3		# do not change unnecessarily
	alg uniform	# do not change bucket size (1) unnecessarily
	hash 0	# rjenkins1
	item device2 weight 1.000 pos 0
}
disk disk3 {
	id -4		# do not change unnecessarily
	alg uniform	# do not change bucket size (1) unnecessarily
	hash 0	# rjenkins1
	item device3 weight 1.000 pos 0
}
controller controller0 {
	id -5		# do not change unnecessarily
	alg uniform	# do not change bucket size (2) unnecessarily
	hash 0	# rjenkins1
	item disk0 weight 1.000 pos 0
	item disk1 weight 1.000 pos 1
}
controller controller1 {
	id -6		# do not change unnecessarily
	alg uniform	# do not change bucket size (2) unnecessarily
	hash 0	# rjenkins1
	item disk2 weight 1.000 pos 0
	item disk3 weight 1.000 pos 1
}
host host0 {
	id -7		# do not change unnecessarily
	alg uniform	# do not change bucket size (2) unnecessarily
	hash 0	# rjenkins1
	item controller0 weight 2.000 pos 0
	item controller1 weight 2.000 pos 1
}
root root {
	id -8		# do not change unnecessarily
	alg straw
	hash 0	# rjenkins1
	item host0 weight 4.000
}

# rules
rule data {
	ruleset 0
	type replicated
	min_size 2
	max_size 2
	step take root
	step chooseleaf firstn 0 type controller
	step emit
}
rule metadata {
	ruleset 1
	type replicated
	min_size 2
	max_size 2
	step take root
	step chooseleaf firstn 0 type controller
	step emit
}
rule casdata {
	ruleset 2
	type replicated
	min_size 2
	max_size 2
	step take root
	step chooseleaf firstn 0 type controller
	step emit
}
rule rbd {
	ruleset 3
	type replicated
	min_size 2
	max_size 2
	step take root
	step chooseleaf firstn 0 type controller
	step emit
}

# end crush map

and for completeness, here's the default map, also via query:

# begin crush map

# devices
device 0 device0
device 1 device1
device 2 device2
device 3 device3

# types
type 0 device
type 1 domain
type 2 pool

# buckets
domain root {
	id -1		# do not change unnecessarily
	alg straw
	hash 0	# rjenkins1
	item device0 weight 1.000
	item device1 weight 1.000
	item device2 weight 1.000
	item device3 weight 1.000
}

# rules
rule data {
	ruleset 0
	type replicated
	min_size 1
	max_size 10
	step take root
	step choose firstn 0 type device
	step emit
}
rule metadata {
	ruleset 1
	type replicated
	min_size 1
	max_size 10
	step take root
	step choose firstn 0 type device
	step emit
}
rule casdata {
	ruleset 2
	type replicated
	min_size 1
	max_size 10
	step take root
	step choose firstn 0 type device
	step emit
}
rule rbd {
	ruleset 3
	type replicated
	min_size 1
	max_size 10
	step take root
	step choose firstn 0 type device
	step emit
}

# end crush map

Here's the ceph.conf I use for both tests.  Note
that for the default map case I just make sure the 
crush map file I configured doesn't exist; mkcepfs -v
output suggests that the right thing happens in both
cases.

; global

[global]
	pid file = /var/run/ceph/$name.pid

	; some minimal logging (just message traffic) to aid debugging
	debug ms = 4

; monitor daemon common options
[mon]
	crush map = /mnt/projects/ceph/root/crushmap
	debug mon = 10

; monitor daemon options per instance
; need an odd number of instances
[mon0]
	host = sasa008
	mon addr = 192.168.204.111:6788
	mon data = /mnt/disk/disk.00p1/mon

; mds daemon common options

[mds]
	debug mds = 10

; mds daemon options per instance
[mds0]
	host = sasa008
	mds addr = 192.168.204.111
	keyring = /mnt/disk/disk.00p1/mds/keyring.$name

; osd daemon common options

[osd]
	; osd client message size cap = 67108864
	debug osd = 10

; osd options per instance; i.e. per crushmap device.

[osd0]
	host = sasa008
	osd addr = 192.168.204.111
	keyring     = /mnt/disk/disk.00p1/osd/keyring.$name
	osd journal = /dev/sdb2
	; btrfs devs  = /dev/sdb5
	; btrfs path  = /mnt/disk/disk.00p5
	osd data    = /mnt/disk/disk.00p5
[osd1]
	host = sasa008
	osd addr = 192.168.204.111
	keyring     = /mnt/disk/disk.01p1/osd/keyring.$name
	osd journal = /dev/sdc2
	; btrfs devs  = /dev/sdc5
	; btrfs path  = /mnt/disk/disk.01p5
	osd data    = /mnt/disk/disk.01p5
[osd2]
	host = sasa008
	osd addr = 192.168.204.111
	keyring     = /mnt/disk/disk.02p1/osd/keyring.$name
	osd journal = /dev/sdj2
	; btrfs devs  = /dev/sdj5
	; btrfs path  = /mnt/disk/disk.02p5
	osd data    = /mnt/disk/disk.02p5
[osd3]
	host = sasa008
	osd addr = 192.168.204.111
	keyring     = /mnt/disk/disk.03p1/osd/keyring.$name
	osd journal = /dev/sdk2
	; btrfs devs  = /dev/sdk5
	; btrfs path  = /mnt/disk/disk.03p5
	osd data    = /mnt/disk/disk.03p5

Maybe I'm still missing something?

Thanks -- Jim

> 
> sage
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html