Re: crush chooseleaf vs. choose

Sage Weil <sage@xxxxxxxxxxx> · Mon, 6 Jan 2014 03:50:03 -0800 (PST)

On Mon, 6 Jan 2014, Dietmar Maurer wrote:
> > 'ceph osd crush tunables optimal'
> > 
> > or adjust an offline map file via the crushtool command line (more
> > annoying) and retest; I suspect that is the problem.
> > 
> > http://ceph.com/docs/master/rados/operations/crush-map/#tunables
> 
> That solves the bug with weight 0, thanks.
> 
> But is still get the following distribution:
> 
>   device 0:     423
>   device 1:     453
>   device 2:     430
>   device 3:     455
>   device 4:     657
>   device 5:     654
> 
> Host with only one osd gets too much data.

I think this is just fundamentally a problem with distributing 3 replicas 
over only 4 hosts.  Every piece of data in the system needs to include 
either host 3 or 4 (and thus device 4 or 5) in order to have 3 replicas 
(on separate hosts).  Add more hosts or disks and the distribution will 
even out.

sage

> 
> > On Fri, 3 Jan 2014, Dietmar Maurer wrote:
> > 
> > > > In both cases, you only get 2 replicas on the remaining 2 hosts.
> > >
> > > OK, I was able to reproduce this with crushtool.
> > >
> > > > The difference is if you have 4 hosts with 2 osds.  In the choose
> > > > case, you have some fraction of the data that chose the down host in
> > > > the first step (most of the attempts, actually!) and then couldn't
> > > > find a usable osd, leaving you with only 2
> > >
> > > This is also reproducible.
> > >
> > > > replicas.  With chooseleaf that doesn't happen.
> > > >
> > > > The other difference is if you have one of the two OSDs on the host marked
> > out.
> > > > In the choose case, the remaining OSD will get allocated 2x the
> > > > data; in the chooseleaf case, usage will remain proportional with
> > > > the rest of the cluster and the data from the out OSD will be
> > > > distributed across other OSDs (at least when there are > 3 hosts!).
> > >
> > > I see, but data distribution seems not optimal in that case.
> > >
> > > For example using this crush map:
> > >
> > > # types
> > > type 0 osd
> > > type 1 host
> > > type 2 rack
> > > type 3 row
> > > type 4 room
> > > type 5 datacenter
> > > type 6 root
> > >
> > > # buckets
> > > host prox-ceph-1 {
> > > 	id -2		# do not change unnecessarily
> > > 	# weight 7.260
> > > 	alg straw
> > > 	hash 0	# rjenkins1
> > > 	item osd.0 weight 3.630
> > > 	item osd.1 weight 3.630
> > > }
> > > host prox-ceph-2 {
> > > 	id -3		# do not change unnecessarily
> > > 	# weight 7.260
> > > 	alg straw
> > > 	hash 0	# rjenkins1
> > > 	item osd.2 weight 3.630
> > > 	item osd.3 weight 3.630
> > > }
> > > host prox-ceph-3 {
> > > 	id -4		# do not change unnecessarily
> > > 	# weight 3.630
> > > 	alg straw
> > > 	hash 0	# rjenkins1
> > > 	item osd.4 weight 3.630
> > > }
> > >
> > > host prox-ceph-4 {
> > > 	id -5		# do not change unnecessarily
> > > 	# weight 3.630
> > > 	alg straw
> > > 	hash 0	# rjenkins1
> > > 	item osd.5 weight 3.630
> > > }
> > >
> > > root default {
> > > 	id -1		# do not change unnecessarily
> > > 	# weight 21.780
> > > 	alg straw
> > > 	hash 0	# rjenkins1
> > > 	item prox-ceph-1 weight 7.260   # 2 OSDs
> > > 	item prox-ceph-2 weight 7.260   # 2 OSDs
> > > 	item prox-ceph-3 weight 3.630   # 1 OSD
> > > 	item prox-ceph-4 weight 3.630   # 1 OSD
> > > }
> > >
> > > # rules
> > > rule data {
> > > 	ruleset 0
> > > 	type replicated
> > > 	min_size 1
> > > 	max_size 10
> > > 	step take default
> > > 	step chooseleaf firstn 0 type host
> > > 	step emit
> > > }
> > > # end crush map
> > >
> > > crushtool shows the following utilization:
> > >
> > > # crushtool --test -i my.map --rule 0 --num-rep 3 --show-utilization
> > >   device 0:	423
> > >   device 1:	452
> > >   device 2:	429
> > >   device 3:	452
> > >   device 4:	661
> > >   device 5:	655
> > >
> > > Any explanation for that?  Maybe related to the small number of devices?
> > >
> > >
> 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com