Re: Different disk usage on different OSDs

Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> · Wed, 7 Jan 2015 18:47:20 -0800

The short answer is that uniform distribution is a lower priority feature of the CRUSH hashing algorithm.
CRUSH is designed to be consistent and stable in it's hashing.  For the details, you can read Sage's paper (http://ceph.com/papers/weil-rados-pdsw07.pdf).  The goal is that if you make a change to your cluster, there will be some moderate data movement, but not everything moves.  If you then undo the change, things will go back to exactly how they were before.

Doing that and getting uniform distribution is hard, and it's work in progress.  The tunables are progress on this front, but they are by no means the last word.

The current work around is to use ceph osd reweight-by-utilization.  That tool will look at data distributions, and reweight things to bring the OSDs more inline with each other.  Unfortunately, it does a ceph osd reweight, not a ceph osd crush reweight.  (The existence of two different weighs with different behavior is unfortunate too).  ceph osd reweight is temporary, in that the value will be lost if a OSD is marked out.  ceph osd crush reweight updates the CRUSHMAP, and it's not temporary.  So I use ceph osd crush reweight manually.

While it would be nice if Ceph would automatically rebalance itself, I'd turn that off.  Moving data around in my small cluster involves a major performance hit.  By manually adjusting the crush weights, I have some control over when and how much data is moved around.

I recommend taking a look a ceph osd tree and df on all nodes, and start adjusting the crush weight of heavily used disks down, and under utilized disks up.  The crush weight is generally the size (base2) of the disk in TiB.  I adjust my OSDs up or down by 0.05 each step, then decide if I need to make another pass. I have one 4 TiB drives with a weight of 4.14, and another with a weight of 3.04.  They're still not balanced, but it's better.

If data migration has a smaller impact on your cluster, larger steps should be fine.  And if anything causes major problems, just revert the change.  CRUSH is stable and consistent :-)

On Mon, Jan 5, 2015 at 2:04 AM, ivan babrou <ibobrik@xxxxxxxxx> wrote:
Hi!
I have a cluster with 106 osds and disk usage is varying from 166gb to 316gb. Disk usage is highly correlated to number of pg per osd (no surprise here). Is there a reason for ceph to allocate more pg on some nodes?

The biggest osds are 30, 42 and 69 (300gb+ each) and the smallest are 87, 33 and 55 (170gb each). The biggest pool has 2048 pgs, pools with very little data has only 8 pgs. PG size in biggest pool is ~6gb (5.1..6.3 actually).

Lack of balanced disk usage prevents me from using all the disk space. When the biggest osd is full, cluster does not accept writes anymore.

Here's gist with info about my cluster: https://gist.github.com/bobrik/fb8ad1d7c38de0ff35ae

-- 
Regards, Ian Babrou
http://bobrik.name http://twitter.com/ibobrik skype:i.babrou

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com