Uneven OSD usage

j.david.lists@xxxxxxxxx (J David) · Thu, 28 Aug 2014 13:00:46 -0400

Hello,

Is there any way to provoke a ceph cluster to level out its OSD usage?

Currently, a cluster of 3 servers with 4 identical OSDs each is
showing disparity of about 20% between the most-used OSD and the
least-used OSD.  This wouldn't be too big of a problem, but the
most-used OSD is now at 86% (with the least-used at 72%).

There are three more nodes on order but they are a couple of weeks
away.  Is there anything I can do in the mean time to push existing
data (and new data) toward less-used OSD's?

Reweighting the OSD's feels intuitively like the wrong approach since
they are all the same size and "should" have the same weight.  Is that
the wrong intuition?

Also, with a test cluster, I did try playing around with
reweight-by-utilization and it actually seemed to make things worse.
But that cluster was assembled from spare parts and the OSD's were
neither all the same size nor were they uniformly distributed between
servers.  This is *not* a test cluster, so I am gun-shy about possibly
making things worse.

Is reweight-by-utilization the right point to poke this?  Or is there
a better tool in the toolbox for this situation?

Here is the OSD tree showing that everything is weighted equally:

# id weight type name up/down reweight
-1 4.2 root default
-2 1.4 host f13
0 0.35 osd.0 up 1
1 0.35 osd.1 up 1
2 0.35 osd.2 up 1
3 0.35 osd.3 up 1
-3 1.4 host f14
4 0.35 osd.4 up 1
9 0.35 osd.9 up 1
10 0.35 osd.10 up 1
11 0.35 osd.11 up 1
-4 1.4 host f15
5 0.35 osd.5 up 1
6 0.35 osd.6 up 1
7 0.35 osd.7 up 1
8 0.35 osd.8 up 1

And the df's of each:

Node 1:

/dev/sda2                                               358G  258G
101G  72% /var/lib/ceph/osd/ceph-0
/dev/sdb2                                               358G  294G
65G  82% /var/lib/ceph/osd/ceph-1
/dev/sdc2                                               358G  278G
81G  78% /var/lib/ceph/osd/ceph-2
/dev/sdd2                                               358G  294G
65G  83% /var/lib/ceph/osd/ceph-3

Node 2:

/dev/sda2                                               358G  285G
73G  80% /var/lib/ceph/osd/ceph-5
/dev/sdb2                                               358G  305G
53G  86% /var/lib/ceph/osd/ceph-6
/dev/sdc2                                               358G  301G
58G  85% /var/lib/ceph/osd/ceph-7
/dev/sdd2                                               358G  299G
60G  84% /var/lib/ceph/osd/ceph-8

Node 3:

/dev/sda2                                               358G  290G
68G  82% /var/lib/ceph/osd/ceph-4
/dev/sdb2                                               358G  297G
62G  83% /var/lib/ceph/osd/ceph-11
/dev/sdc2                                               358G  285G
73G  80% /var/lib/ceph/osd/ceph-10
/dev/sdd2                                               358G  306G
53G  86% /var/lib/ceph/osd/ceph-9

Ideally we would like to get about 125 gigs more data (with num of
replicas set to 2) onto this pool before the additional nodes arrive,
which would put *everything* at about 86% if everything were evenly
balanced.  But the way it's currently going, that'll have the busiest
OSD dangerously close to 95%.  (Apparently data increases faster than
you expect, even if you account for this. :-P )

What's the best way forward?

Thanks for any advice!