Re: Rebalancing

David Turner <drakonstein@xxxxxxxxx> · Tue, 25 Apr 2017 17:29:31 +0000

Where my method of generating balanced CRUSH maps really shines is when you add/remove/replace storage.  Adding in new storage nodes to a full cluster can often leave you stuck with osd_nearfull and you can't finish the backfill.  I generate a new map before I start backfilling onto the new hardware and the cluster backfills to a balanced state on it's first go.  Same with removing storage/replace it (re: migrating to bluestore).  You can weight as many nodes as you can spare in your cluster to 0.0 and the cluster will backfill off of them, then you remove and re-add them in as bluestore while weighting the next set of nodes to 0.0.  The entire time remaining in a balanced and operable state running close to 75-80%.  The best part is that the entire time you will not have degraded objects.
Other than Sage mentioning it here in this thread, I haven't heard about how Luminous will be managing data distribution differently.  It sounds amazing and makes me very excited, but I know nothing about it.

On Tue, Apr 25, 2017 at 12:31 PM Bryan Stillwell <bstillwell@xxxxxxxxxxx> wrote:
I've played with mixing the crush reweight and OSD reweight before, but I've found it just complicates things.  I've been pretty successful on my home cluster with only using just OSD reweights and that's with OSDs ranging in size from 500GB to 6TB.

I wrote a tool that takes the output of 'ceph osd df -f json' and 'ceph pg dump -f json' and uses the size of each PG to determine how much data each OSD is using now (the 'up' set) and how much each OSD will use after rebalancing is complete (the 'acting' set).  I can then adjust the OSD reweights and see what the effect will be after rebalancing is completed.

However, I've been meaning to write a new tool based on what I learned reading through David Turner's reweighting script (http://lists.ceph.com/pipermail/ceph-large-ceph.com/2017-February/000040.html).  Using osdmaptool and crushtool you can test the effect of your OSD reweights offline and see what the result will be.  I did this successfully at my last job to determine the effect of changing the straw version like this:

ceph osd getmap -o osd.map

osdmaptool osd.map --export-crush crush-straw0.map

crushtool -i crush-straw0.map --set-straw-calc-version 1 -o crush-straw1.map

crushtool -i crush-straw1.map --reweight -o crush-straw1.map

osdmaptool osd.map --import-crush crush-straw0.map --test-map-pgs-dump >osd-mappings-straw0.txt

osdmaptool osd.map --import-crush crush-straw1.map --test-map-pgs-dump >osd-mappings-straw1.txt

That would make the process of looping through all the PGs, reweighting the most full one down, checking the result, and repeating until the cluster is balanced a much quicker process.

I'm looking forward to the day when I can have Luminous in production to handle this automatically with ceph-mgr!

Bryan

On 4/25/17, 12:22 AM, "Ceph-large on behalf of Anthony D'Atri" <ceph-large-bounces@xxxxxxxxxxxxxx on behalf of aad@xxxxxxxxxxxxxx> wrote:

I read this thread with interest because I’ve been squeezing the OSD distirbution on several clusters mysel while expansion gear is in the pipline, ending up with an ugly mix of both types of reweight as well as temporarily raising the full and backfill full ratios.

I’d been contemplating tweaking Dan@CERN’s reweighting script to use CRUSH reweighting instead, and to squeeze from both ends, though I fear it might not be as simple as it sounds prima fascia.

Aaron wrote:

> Should I be expecting it to decide to increase some underutilized osds?

The osd reweight mechanism only accomodates an override weight between 0 and 1, thus it can decrease but not increase a given OSD’s fullness.  To directly fill up underfull OSD’s it would seem to to need an override weight > 1, which isn’t possible.

I haven’t personally experienced it (yet), but from what I read, if override reweighted OSD’s get marked out and back in again, their override will revert to 1.  In a case where a cluster is running close to the full ratio, this would *seem* as though a network glitch etc. might result in some OSD’s filling up and hitting the full threshold, which would be bad.

Using CRUSH reweight instead would seem to address both of these shortcomings, though it does perturb the arbitrary but useful way that initial CRUSH weights by default reflect the capacity of each OSD.  Various references  also indicate that the override reweight does not change the weight of buckets above the OSD, but that CRUSH reweight does.  I haven’t found any discussion of the ramifications of this, but my inital stab at it would be that when one does the 0-1 override reweight, the “extra’ data is redistributed to OSD’s on the same node.  CRUSH reweighting would then seem to pull / push the wad of data being adjusted from / to *other* OSD nodes.  Or it could be that I’m out of my Vulcan mind.

Thus adjusting the weight of a given OSD affects the fullness of other OSD’s, in ways that would seem to differ depending on which method is used.  As I think you implied in one of your messages, sometimes this can result in the fullness of one or more OSD’s climbing relatively sharply, even to a point distinctly above where the previous most-full OSDs were.

I lurked in the recent developer’s meeting where strategies for A Better Way in Luminous were discussed.  While the plans are exciting and hold promise for uniform and thus greater safe utilization of a cluster’s raw space, I suspect though that between dev/test time and the attrition needed to update running clients, those of us running existing RBD clusters won’t be able to take advantage of them for some time.

— Anthony

_______________________________________________

Ceph-large mailing list

Ceph-large@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com

_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com