Re: Very unbalanced osd data placement with differing sized devices

David Zafman <david.zafman@xxxxxxxxxxx> · Wed, 16 Oct 2013 20:28:19 -0700

I may be wrong, but I always thought that a weight of 0 means don't put anything there.  All weights > 0 will be looked at proportionally.

See http://ceph.com/docs/master/rados/operations/crush-map/ which recommends higher weights anyway:

Weighting Bucket Items

Ceph expresses bucket weights as double integers, which allows for fine weighting. A weight is the relative difference between device capacities. We recommend using 1.00 as the relative weight for a 1TB storage device. In such a scenario, a weight of 0.5 would represent approximately 500GB, and a weight of 3.00 would represent approximately 3TB. Higher level buckets have a weight that is the sum total of the leaf items aggregated by the bucket.

A bucket item weight is one dimensional, but you may also calculate your item weights to reflect the performance of the storage drive. For example, if you have many 1TB drives where some have relatively low data transfer rate and the others have a relatively high data transfer rate, you may weight them differently, even though they have the same capacity (e.g., a weight of 0.80 for the first set of drives with lower total throughput, and 1.20 for the second set of drives with higher total throughput).

David Zafman
Senior Developer
http://www.inktank.com

On Oct 16, 2013, at 8:15 PM, Mark Kirkwood <mark.kirkwood@xxxxxxxxxxxxxxx> wrote:

> I stumbled across this today:
> 
> 4 osds on 4 hosts (names ceph1 -> ceph4). They are KVM guests (this is a play setup).
> 
> - ceph1 and ceph2 each have a 5G volume for osd data (+ 2G vol for journal)
> - ceph3 and ceph4 each have a 10G volume for osd data (+ 2G vol for journal)
> 
> I do a standard installation via ceph-deploy (1.2.7) of ceph (0.67.4) on each one [1]. The topology looks like:
> 
> $ ceph osd tree
> # id    weight    type name    up/down    reweight
> -1    0.01999    root default
> -2    0        host ceph1
> 0    0            osd.0    up    1
> -3    0        host ceph2
> 1    0            osd.1    up    1
> -4    0.009995        host ceph3
> 2    0.009995            osd.2    up    1
> -5    0.009995        host ceph4
> 3    0.009995            osd.3    up    1
> 
> So osd.0 and osd.1 (on ceph1,2) have weight 0, and osd2 and osd.3 (on ceph3,4) have weight 0.009995 this suggests that data will flee osd.0,1 and live only on osd.3.4. Sure enough putting in a few objects via radus put results in:
> 
> ceph1 $ df -m
> Filesystem     1M-blocks  Used Available Use% Mounted on
> /dev/vda1           5038  2508      2275  53% /
> udev                 994     1       994   1% /dev
> tmpfs                401     1       401   1% /run
> none                   5     0         5   0% /run/lock
> none                1002     0      1002   0% /run/shm
> /dev/vdb1           5109    40      5070   1% /var/lib/ceph/osd/ceph-0
> 
> (similarly for ceph2), whereas:
> 
> ceph3 $df -m
> Filesystem     1M-blocks  Used Available Use% Mounted on
> /dev/vda1           5038  2405      2377  51% /
> udev                 994     1       994   1% /dev
> tmpfs                401     1       401   1% /run
> none                   5     0         5   0% /run/lock
> none                1002     0      1002   0% /run/shm
> /dev/vdb1          10229  1315      8915  13% /var/lib/ceph/osd/ceph-2
> 
> (similarly for ceph4). Obviously I can fix this via the reweighting the first two osds to something like 0.005, but I'm wondering if there is something I've missed - clearly some kind of auto weighting is has been performed on the basis of the size difference in the data volumes, but looks to be skewing data far too much to the bigger ones. Is there perhaps a bug in the smarts for this? Or is it just because I'm using small volumes (5G = 0 weight)?
> 
> Cheers
> 
> Mark
> 
> [1] i.e:
> 
> $ ceph-deploy new ceph1
> $ ceph-deploy mon create ceph1
> $ ceph-deploy gatherkeys ceph1
> $ ceph-deploy osd create ceph1:/dev/vdb:/dev/vdc
> ...
> $ ceph-deploy osd create ceph4:/dev/vdb:/dev/vdc
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com