Re: Rebalancing

Wido den Hollander <wido@xxxxxxxx> · Thu, 16 Nov 2017 15:23:28 +0100 (CET)

> Op 16 november 2017 om 15:20 schreef Sage Weil <sage@xxxxxxxxxxxx>:
> 
> 
> On Thu, 16 Nov 2017, Rafał Wądołowski wrote:
> > Sage,
> > 
> > you write about 'automatic balancer module', what do you mean? Could tell us
> > more? or paste hyperlinks
> 
> https://github.com/ceph/ceph/blob/master/src/pybind/mgr/balancer/module.py
> 
> There will be a blog post as soon as 12.2.2 is out.  Basically you can do
> 
>  ceph balancer mode crush-compat
>  ceph balancer on
> 
> and walk away.
> 

That seems very nice! Would really make life a lot easier.

Awesome!

Wido

> sage
> 
> 
> > 
> > BR,
> > 
> > Rafał Wądołowski
> > 
> > 
> > http://cloudferro.com/ <http://cloudferro.com/>
> > On 16.11.2017 14:08, Sage Weil wrote:
> > > On Thu, 16 Nov 2017, Pavan Rallabhandi wrote:
> > > > Had to revive this old thread, had couple of questions.
> > > > 
> > > > Since `ceph osd reweight-by-utilization` is changing the weights of the
> > > > OSDs but not the CRUSH weights, is it still not a problem (as the OSD
> > > > weights would be reset to 1) if those reweighted OSDs go OUT of the
> > > > cluster and later get marked IN?
> > > > 
> > > > I thought since OSD weights are not persistent across OUT/IN cycles, it
> > > > is not a lasting solution to use `ceph osd reweight` or `ceph osd
> > > > reweight-by-utilization`.
> > > This was fixed a while go, and a superficial check of the jewel code
> > > indicates that the in/out values are persistent now.  Have you observed
> > > them getting reset with jewel?
> > > 
> > > > We are having balancing issues on one of our Jewel clusters and I wanted
> > > > to understand the pros of using `ceph osd reweight-by-utilization` over
> > > > `ceph osd crush reweight`.
> > > Both will get the job done, but I would stick with reweight-by-utilization
> > > as it keeps the real CRUSH weight matched to the device size.  Once you
> > > move to luminous, it will be an easier transition to the automatic
> > > balancer module (which handles all of this for you).
> > > 
> > > sage
> > > 
> > > 
> > > > Thanks,
> > > > -Pavan.
> > > > 
> > > > From: Ceph-large <ceph-large-bounces@xxxxxxxxxxxxxx> on behalf of Dan Van
> > > > Der Ster <daniel.vanderster@xxxxxxx>
> > > > Date: Tuesday, 25 April 2017 at 11:36 PM
> > > > To: Anthony D'Atri <aad@xxxxxxxxxxxxxx>
> > > > Cc: "ceph-large@xxxxxxxxxxxxxx" <ceph-large@xxxxxxxxxxxxxx>
> > > > Subject: EXT: Re:  Rebalancing
> > > > 
> > > > We run this continuously -- in a cron every 2 hours -- on all of our
> > > > clusters:
> > > > https://github.com/cernceph/ceph-scripts/blob/master/tools/crush-reweight-by-utilization.py
> > > > It's a misnomer, yes -- because my original plan was indeed to modify
> > > > CRUSH weights but for some reason which I do not recall, I switch it to
> > > > modify the reweights. It should be super easy to change the crush weight
> > > > instead.
> > > > We run it with params to change weights of only 4 OSDs by 0.01 at a time.
> > > > This ever so gradually flattens the PG distribution, and is totally
> > > > transparent latency-wise.
> > > > BTW, it supports reweighting only below certain CRUSH buckets, which is
> > > > essential if you have a non-uniform OSD tree.
> > > > 
> > > > For adding in new hardware, we use this script:
> > > > https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-gentle-reweight
> > > > New OSDs start with crush weight 0, then we gradually increase the weights
> > > > 0.01 at a time, all the while watching the number of backfills and cluster
> > > > latency.
> > > > The same script is used to gradually drain OSDs down to CRUSH weight 0.
> > > > We've used that second script to completely replace several petabytes of
> > > > hardware.
> > > > 
> > > > Cheers, Dan
> > > > 
> > > > 
> > > > On 25 Apr 2017, at 08:22, Anthony D'Atri
> > > > <aad@xxxxxxxxxxxxxx<mailto:aad@xxxxxxxxxxxxxx>> wrote:
> > > > 
> > > > I read this thread with interest because I’ve been squeezing the OSD
> > > > distirbution on several clusters mysel while expansion gear is in the
> > > > pipline, ending up with an ugly mix of both types of reweight as well as
> > > > temporarily raising the full and backfill full ratios.
> > > > 
> > > > I’d been contemplating tweaking Dan@CERN’s reweighting script to use CRUSH
> > > > reweighting instead, and to squeeze from both ends, though I fear it might
> > > > not be as simple as it sounds prima fascia.
> > > > 
> > > > 
> > > > Aaron wrote:
> > > > 
> > > > 
> > > > Should I be expecting it to decide to increase some underutilized osds?
> > > > 
> > > > 
> > > > The osd reweight mechanism only accomodates an override weight between 0
> > > > and 1, thus it can decrease but not increase a given OSD’s fullness.  To
> > > > directly fill up underfull OSD’s it would seem to to need an override
> > > > weight > 1, which isn’t possible.
> > > > 
> > > > I haven’t personally experienced it (yet), but from what I read, if
> > > > override reweighted OSD’s get marked out and back in again, their override
> > > > will revert to 1.  In a case where a cluster is running close to the full
> > > > ratio, this would *seem* as though a network glitch etc. might result in
> > > > some OSD’s filling up and hitting the full threshold, which would be bad.
> > > > 
> > > > Using CRUSH reweight instead would seem to address both of these
> > > > shortcomings, though it does perturb the arbitrary but useful way that
> > > > initial CRUSH weights by default reflect the capacity of each OSD.
> > > > Various references  also indicate that the override reweight does not
> > > > change the weight of buckets above the OSD, but that CRUSH reweight does.
> > > > I haven’t found any discussion of the ramifications of this, but my inital
> > > > stab at it would be that when one does the 0-1 override reweight, the
> > > > “extra’ data is redistributed to OSD’s on the same node.  CRUSH
> > > > reweighting would then seem to pull / push the wad of data being adjusted
> > > > from / to *other* OSD nodes.  Or it could be that I’m out of my Vulcan
> > > > mind.
> > > > 
> > > > Thus adjusting the weight of a given OSD affects the fullness of other
> > > > OSD’s, in ways that would seem to differ depending on which method is
> > > > used.  As I think you implied in one of your messages, sometimes this can
> > > > result in the fullness of one or more OSD’s climbing relatively sharply,
> > > > even to a point distinctly above where the previous most-full OSDs were.
> > > > 
> > > > I lurked in the recent developer’s meeting where strategies for A Better
> > > > Way in Luminous were discussed.  While the plans are exciting and hold
> > > > promise for uniform and thus greater safe utilization of a cluster’s raw
> > > > space, I suspect though that between dev/test time and the attrition
> > > > needed to update running clients, those of us running existing RBD
> > > > clusters won’t be able to take advantage of them for some time.
> > > > 
> > > > — Anthony
> > > > 
> > > > 
> > > > _______________________________________________
> > > > Ceph-large mailing list
> > > > Ceph-large@xxxxxxxxxxxxxx<mailto:Ceph-large@xxxxxxxxxxxxxx>
> > > > http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com
> > > > 
> > > > 
> > > > 
> > > > 
> > > > _______________________________________________
> > > > Ceph-large mailing list
> > > > Ceph-large@xxxxxxxxxxxxxx
> > > > http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com
> > 
> > _______________________________________________
> Ceph-large mailing list
> Ceph-large@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com
_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com