Re: Very unbalanced osd data placement with differing sized devices

Sage Weil <sage@xxxxxxxxxxx> · Wed, 16 Oct 2013 21:03:03 -0700 (PDT)

On Thu, 17 Oct 2013, Mark Kirkwood wrote:
> I stumbled across this today:
> 
> 4 osds on 4 hosts (names ceph1 -> ceph4). They are KVM guests (this is a play
> setup).
> 
> - ceph1 and ceph2 each have a 5G volume for osd data (+ 2G vol for journal)
> - ceph3 and ceph4 each have a 10G volume for osd data (+ 2G vol for journal)
> 
> I do a standard installation via ceph-deploy (1.2.7) of ceph (0.67.4) on each
> one [1]. The topology looks like:
> 
> $ ceph osd tree
> # id    weight    type name    up/down    reweight
> -1    0.01999    root default
> -2    0        host ceph1
> 0    0            osd.0    up    1
> -3    0        host ceph2
> 1    0            osd.1    up    1
> -4    0.009995        host ceph3
> 2    0.009995            osd.2    up    1
> -5    0.009995        host ceph4
> 3    0.009995            osd.3    up    1
> 
> So osd.0 and osd.1 (on ceph1,2) have weight 0, and osd2 and osd.3 (on ceph3,4)
> have weight 0.009995 this suggests that data will flee osd.0,1 and live only
> on osd.3.4. Sure enough putting in a few objects via radus put results in:
> 
> ceph1 $ df -m
> Filesystem     1M-blocks  Used Available Use% Mounted on
> /dev/vda1           5038  2508      2275  53% /
> udev                 994     1       994   1% /dev
> tmpfs                401     1       401   1% /run
> none                   5     0         5   0% /run/lock
> none                1002     0      1002   0% /run/shm
> /dev/vdb1           5109    40      5070   1% /var/lib/ceph/osd/ceph-0
> 
> (similarly for ceph2), whereas:
> 
> ceph3 $df -m
> Filesystem     1M-blocks  Used Available Use% Mounted on
> /dev/vda1           5038  2405      2377  51% /
> udev                 994     1       994   1% /dev
> tmpfs                401     1       401   1% /run
> none                   5     0         5   0% /run/lock
> none                1002     0      1002   0% /run/shm
> /dev/vdb1          10229  1315      8915  13% /var/lib/ceph/osd/ceph-2
> 
> (similarly for ceph4). Obviously I can fix this via the reweighting the first
> two osds to something like 0.005, but I'm wondering if there is something I've
> missed - clearly some kind of auto weighting is has been performed on the
> basis of the size difference in the data volumes, but looks to be skewing data
> far too much to the bigger ones. Is there perhaps a bug in the smarts for
> this? Or is it just because I'm using small volumes (5G = 0 weight)?

Yeah, I think this is just rounding error.  By default a weight of 1.0 == 
1 TB, so these are just very small numbers.  Internally, we're storing 
as a fixed-point 32-bit value where 1.0 == 0x10000, and 5MB is just too 
small for those units.

You can disable this autoweighting with 

 osd crush update on start = false

in ceph.conf.

sage

> 
> Cheers
> 
> Mark
> 
> [1] i.e:
> 
> $ ceph-deploy new ceph1
> $ ceph-deploy mon create ceph1
> $ ceph-deploy gatherkeys ceph1
> $ ceph-deploy osd create ceph1:/dev/vdb:/dev/vdc
> ...
> $ ceph-deploy osd create ceph4:/dev/vdb:/dev/vdc
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com