Re: Rebalancing

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 16 Nov 2017 14:20:04 +0000 (UTC)

On Thu, 16 Nov 2017, Rafał Wądołowski wrote:
> Sage,
> 
> you write about 'automatic balancer module', what do you mean? Could tell us
> more? or paste hyperlinks

https://github.com/ceph/ceph/blob/master/src/pybind/mgr/balancer/module.py

There will be a blog post as soon as 12.2.2 is out.  Basically you can do

 ceph balancer mode crush-compat
 ceph balancer on

and walk away.

sage

> 
> BR,
> 
> Rafał Wądołowski
> 
> 
> http://cloudferro.com/ <http://cloudferro.com/>
> On 16.11.2017 14:08, Sage Weil wrote:
> > On Thu, 16 Nov 2017, Pavan Rallabhandi wrote:
> > > Had to revive this old thread, had couple of questions.
> > > 
> > > Since `ceph osd reweight-by-utilization` is changing the weights of the
> > > OSDs but not the CRUSH weights, is it still not a problem (as the OSD
> > > weights would be reset to 1) if those reweighted OSDs go OUT of the
> > > cluster and later get marked IN?
> > > 
> > > I thought since OSD weights are not persistent across OUT/IN cycles, it
> > > is not a lasting solution to use `ceph osd reweight` or `ceph osd
> > > reweight-by-utilization`.
> > This was fixed a while go, and a superficial check of the jewel code
> > indicates that the in/out values are persistent now.  Have you observed
> > them getting reset with jewel?
> > 
> > > We are having balancing issues on one of our Jewel clusters and I wanted
> > > to understand the pros of using `ceph osd reweight-by-utilization` over
> > > `ceph osd crush reweight`.
> > Both will get the job done, but I would stick with reweight-by-utilization
> > as it keeps the real CRUSH weight matched to the device size.  Once you
> > move to luminous, it will be an easier transition to the automatic
> > balancer module (which handles all of this for you).
> > 
> > sage
> > 
> > 
> > > Thanks,
> > > -Pavan.
> > > 
> > > From: Ceph-large <ceph-large-bounces@xxxxxxxxxxxxxx> on behalf of Dan Van
> > > Der Ster <daniel.vanderster@xxxxxxx>
> > > Date: Tuesday, 25 April 2017 at 11:36 PM
> > > To: Anthony D'Atri <aad@xxxxxxxxxxxxxx>
> > > Cc: "ceph-large@xxxxxxxxxxxxxx" <ceph-large@xxxxxxxxxxxxxx>
> > > Subject: EXT: Re:  Rebalancing
> > > 
> > > We run this continuously -- in a cron every 2 hours -- on all of our
> > > clusters:
> > > https://github.com/cernceph/ceph-scripts/blob/master/tools/crush-reweight-by-utilization.py
> > > It's a misnomer, yes -- because my original plan was indeed to modify
> > > CRUSH weights but for some reason which I do not recall, I switch it to
> > > modify the reweights. It should be super easy to change the crush weight
> > > instead.
> > > We run it with params to change weights of only 4 OSDs by 0.01 at a time.
> > > This ever so gradually flattens the PG distribution, and is totally
> > > transparent latency-wise.
> > > BTW, it supports reweighting only below certain CRUSH buckets, which is
> > > essential if you have a non-uniform OSD tree.
> > > 
> > > For adding in new hardware, we use this script:
> > > https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-gentle-reweight
> > > New OSDs start with crush weight 0, then we gradually increase the weights
> > > 0.01 at a time, all the while watching the number of backfills and cluster
> > > latency.
> > > The same script is used to gradually drain OSDs down to CRUSH weight 0.
> > > We've used that second script to completely replace several petabytes of
> > > hardware.
> > > 
> > > Cheers, Dan
> > > 
> > > 
> > > On 25 Apr 2017, at 08:22, Anthony D'Atri
> > > <aad@xxxxxxxxxxxxxx<mailto:aad@xxxxxxxxxxxxxx>> wrote:
> > > 
> > > I read this thread with interest because I’ve been squeezing the OSD
> > > distirbution on several clusters mysel while expansion gear is in the
> > > pipline, ending up with an ugly mix of both types of reweight as well as
> > > temporarily raising the full and backfill full ratios.
> > > 
> > > I’d been contemplating tweaking Dan@CERN’s reweighting script to use CRUSH
> > > reweighting instead, and to squeeze from both ends, though I fear it might
> > > not be as simple as it sounds prima fascia.
> > > 
> > > 
> > > Aaron wrote:
> > > 
> > > 
> > > Should I be expecting it to decide to increase some underutilized osds?
> > > 
> > > 
> > > The osd reweight mechanism only accomodates an override weight between 0
> > > and 1, thus it can decrease but not increase a given OSD’s fullness.  To
> > > directly fill up underfull OSD’s it would seem to to need an override
> > > weight > 1, which isn’t possible.
> > > 
> > > I haven’t personally experienced it (yet), but from what I read, if
> > > override reweighted OSD’s get marked out and back in again, their override
> > > will revert to 1.  In a case where a cluster is running close to the full
> > > ratio, this would *seem* as though a network glitch etc. might result in
> > > some OSD’s filling up and hitting the full threshold, which would be bad.
> > > 
> > > Using CRUSH reweight instead would seem to address both of these
> > > shortcomings, though it does perturb the arbitrary but useful way that
> > > initial CRUSH weights by default reflect the capacity of each OSD.
> > > Various references  also indicate that the override reweight does not
> > > change the weight of buckets above the OSD, but that CRUSH reweight does.
> > > I haven’t found any discussion of the ramifications of this, but my inital
> > > stab at it would be that when one does the 0-1 override reweight, the
> > > “extra’ data is redistributed to OSD’s on the same node.  CRUSH
> > > reweighting would then seem to pull / push the wad of data being adjusted
> > > from / to *other* OSD nodes.  Or it could be that I’m out of my Vulcan
> > > mind.
> > > 
> > > Thus adjusting the weight of a given OSD affects the fullness of other
> > > OSD’s, in ways that would seem to differ depending on which method is
> > > used.  As I think you implied in one of your messages, sometimes this can
> > > result in the fullness of one or more OSD’s climbing relatively sharply,
> > > even to a point distinctly above where the previous most-full OSDs were.
> > > 
> > > I lurked in the recent developer’s meeting where strategies for A Better
> > > Way in Luminous were discussed.  While the plans are exciting and hold
> > > promise for uniform and thus greater safe utilization of a cluster’s raw
> > > space, I suspect though that between dev/test time and the attrition
> > > needed to update running clients, those of us running existing RBD
> > > clusters won’t be able to take advantage of them for some time.
> > > 
> > > — Anthony
> > > 
> > > 
> > > _______________________________________________
> > > Ceph-large mailing list
> > > Ceph-large@xxxxxxxxxxxxxx<mailto:Ceph-large@xxxxxxxxxxxxxx>
> > > http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com
> > > 
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > Ceph-large mailing list
> > > Ceph-large@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com
> 
> 
_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com