Re: Getting placement groups to place evenly (again)

David Clarke <davidc@xxxxxxxxxxxxxxx> · Wed, 08 Apr 2015 17:04:52 +1200

On 08/04/15 15:16, J David wrote:
> Getting placement groups to be placed evenly continues to be a major
> challenge for us, bordering on impossible.
> 
> When we first reported trouble with this, the ceph cluster had 12
> OSD's (each Intel DC S3700 400GB) spread across three nodes.  Since
> then, it has grown to 8 nodes with 38 OSD's.
> 
> The average utilization is 80%.  With weights all set to 1, utlization
> varies from 53% to 96%.  Immediately after "ceph osd
> reweight-by-utilization 105" it varies from 61% to 90%.  Essentially,
> once utilization goes over 75%, managing the osd weights to keep all
> of them under 90% becomes a full-time job.
> 
> This is on 0.80.9 with optimal tunables (including chooseleaf_vary_r=1
> and straw_calc_version=1 setting.  The pool has 2048 placement groups
> and has size=2.
> 
> What, if anything, can we do about this?  The goals are twofold, and
> in priority order:
> 
> 1) Guarantee that the cluster can survive the loss of a node without
> dying because one "unlucky" OSD overfills.
> 
> 2) Utilize the available space as efficiently as possible.  We are
> targeting 85% utilization, but currently things to get ugly pretty
> quickly over 75%.
> 
> Thanks for any advice!

As I understand it CRUSH is better at being fast and consistent, than fair.

We've seen similar levels of variation in the utilisation of OSDs, and
have taken to using 'ceph osd reweight osd.$n $weight' directly.  The
new weight is calculated based on the current weight, and the
utilisation of each OSD as compared to the average utilisation across
the cluster.

We're using a rather rudimentary calculation at the moment, but it seems
to do the trick.

new_weight = current_weight / (osd_utilisation / average_utilisation)

eg:
new_weight = 1.00 / (100 GB / 120 GB)
           = 1.2

The new 'ceph osd df --format json' command, available in Hammer (0.94)
dumps out the current weights, and disk usage, of each OSD, and could be
used as the input for a script which does this.

This does need to be re-done after modifying the cluster, by adding or
removing disks for example, but does seem to keep the usage of the OSDs
far closer together.

Initially I looked at balancing PGs across the OSDs, not data, but this
method assumes that the amount of data in a pool corresponds closely
with the number of PGs it has.

In any case, I'd recommend more capacity.  At that level of utilisation
a node failure or two could very quickly cause OSDs to fill even if
they're well balanced.

-- 
David Clarke
Systems Architect
Catalyst IT
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com