Re: Rebalancing

Aaron Bassett <Aaron.Bassett@xxxxxxxxxxxxx> · Thu, 16 Nov 2017 14:22:00 +0000

Having just stood up a new 8PB cluster on luminous/bluestore, this makes me very happy!

Aaron

> On Nov 16, 2017, at 9:20 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>
> On Thu, 16 Nov 2017, Rafał Wądołowski wrote:
>> Sage,
>>
>> you write about 'automatic balancer module', what do you mean? Could tell us
>> more? or paste hyperlinks
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ceph_ceph_blob_master_src_pybind_mgr_balancer_module.py&d=DwIDaQ&c=Tpa2GKmmYSmpYS4baANxQwQYqA0vwGXwkJOPBegaiTs&r=5nKer5huNDFQXjYpOR4o_7t5CRI8wb5Vb_v1pBywbYw&m=7EBxahYku433mvnl5QbCdwy1YeskN0Qg-6zAI26VwlU&s=K8ruEWTADzJdgvwfCLdEe-CL7MKWBMS97OSKIzdDK7g&e=
>
> There will be a blog post as soon as 12.2.2 is out.  Basically you can do
>
> ceph balancer mode crush-compat
> ceph balancer on
>
> and walk away.
>
> sage
>
>
>>
>> BR,
>>
>> Rafał Wądołowski
>>
>>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__cloudferro.com_&d=DwIDaQ&c=Tpa2GKmmYSmpYS4baANxQwQYqA0vwGXwkJOPBegaiTs&r=5nKer5huNDFQXjYpOR4o_7t5CRI8wb5Vb_v1pBywbYw&m=7EBxahYku433mvnl5QbCdwy1YeskN0Qg-6zAI26VwlU&s=mEWTq8IkcSOU1xy1UMrRcOx1A1Zm9lE8ukh1BWlNrNs&e=  <https://urldefense.proofpoint.com/v2/url?u=http-3A__cloudferro.com_&d=DwIDaQ&c=Tpa2GKmmYSmpYS4baANxQwQYqA0vwGXwkJOPBegaiTs&r=5nKer5huNDFQXjYpOR4o_7t5CRI8wb5Vb_v1pBywbYw&m=7EBxahYku433mvnl5QbCdwy1YeskN0Qg-6zAI26VwlU&s=mEWTq8IkcSOU1xy1UMrRcOx1A1Zm9lE8ukh1BWlNrNs&e= >
>> On 16.11.2017 14:08, Sage Weil wrote:
>>> On Thu, 16 Nov 2017, Pavan Rallabhandi wrote:
>>>> Had to revive this old thread, had couple of questions.
>>>>
>>>> Since `ceph osd reweight-by-utilization` is changing the weights of the
>>>> OSDs but not the CRUSH weights, is it still not a problem (as the OSD
>>>> weights would be reset to 1) if those reweighted OSDs go OUT of the
>>>> cluster and later get marked IN?
>>>>
>>>> I thought since OSD weights are not persistent across OUT/IN cycles, it
>>>> is not a lasting solution to use `ceph osd reweight` or `ceph osd
>>>> reweight-by-utilization`.
>>> This was fixed a while go, and a superficial check of the jewel code
>>> indicates that the in/out values are persistent now.  Have you observed
>>> them getting reset with jewel?
>>>
>>>> We are having balancing issues on one of our Jewel clusters and I wanted
>>>> to understand the pros of using `ceph osd reweight-by-utilization` over
>>>> `ceph osd crush reweight`.
>>> Both will get the job done, but I would stick with reweight-by-utilization
>>> as it keeps the real CRUSH weight matched to the device size.  Once you
>>> move to luminous, it will be an easier transition to the automatic
>>> balancer module (which handles all of this for you).
>>>
>>> sage
>>>
>>>
>>>> Thanks,
>>>> -Pavan.
>>>>
>>>> From: Ceph-large <ceph-large-bounces@xxxxxxxxxxxxxx> on behalf of Dan Van
>>>> Der Ster <daniel.vanderster@xxxxxxx>
>>>> Date: Tuesday, 25 April 2017 at 11:36 PM
>>>> To: Anthony D'Atri <aad@xxxxxxxxxxxxxx>
>>>> Cc: "ceph-large@xxxxxxxxxxxxxx" <ceph-large@xxxxxxxxxxxxxx>
>>>> Subject: EXT: Re:  Rebalancing
>>>>
>>>> We run this continuously -- in a cron every 2 hours -- on all of our
>>>> clusters:
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cernceph_ceph-2Dscripts_blob_master_tools_crush-2Dreweight-2Dby-2Dutilization.py&d=DwIDaQ&c=Tpa2GKmmYSmpYS4baANxQwQYqA0vwGXwkJOPBegaiTs&r=5nKer5huNDFQXjYpOR4o_7t5CRI8wb5Vb_v1pBywbYw&m=7EBxahYku433mvnl5QbCdwy1YeskN0Qg-6zAI26VwlU&s=g9su8def9Nfru1p7Xogs1xC-8mLlJWJrlL33IBuy8G4&e=
>>>> It's a misnomer, yes -- because my original plan was indeed to modify
>>>> CRUSH weights but for some reason which I do not recall, I switch it to
>>>> modify the reweights. It should be super easy to change the crush weight
>>>> instead.
>>>> We run it with params to change weights of only 4 OSDs by 0.01 at a time.
>>>> This ever so gradually flattens the PG distribution, and is totally
>>>> transparent latency-wise.
>>>> BTW, it supports reweighting only below certain CRUSH buckets, which is
>>>> essential if you have a non-uniform OSD tree.
>>>>
>>>> For adding in new hardware, we use this script:
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cernceph_ceph-2Dscripts_blob_master_tools_ceph-2Dgentle-2Dreweight&d=DwIDaQ&c=Tpa2GKmmYSmpYS4baANxQwQYqA0vwGXwkJOPBegaiTs&r=5nKer5huNDFQXjYpOR4o_7t5CRI8wb5Vb_v1pBywbYw&m=7EBxahYku433mvnl5QbCdwy1YeskN0Qg-6zAI26VwlU&s=U4LFRhFDa_j0MFgXiAZ9bFuPVzjHxNpPz_jQC8Fxpkk&e=
>>>> New OSDs start with crush weight 0, then we gradually increase the weights
>>>> 0.01 at a time, all the while watching the number of backfills and cluster
>>>> latency.
>>>> The same script is used to gradually drain OSDs down to CRUSH weight 0.
>>>> We've used that second script to completely replace several petabytes of
>>>> hardware.
>>>>
>>>> Cheers, Dan
>>>>
>>>>
>>>> On 25 Apr 2017, at 08:22, Anthony D'Atri
>>>> <aad@xxxxxxxxxxxxxx<mailto:aad@xxxxxxxxxxxxxx>> wrote:
>>>>
>>>> I read this thread with interest because I’ve been squeezing the OSD
>>>> distirbution on several clusters mysel while expansion gear is in the
>>>> pipline, ending up with an ugly mix of both types of reweight as well as
>>>> temporarily raising the full and backfill full ratios.
>>>>
>>>> I’d been contemplating tweaking Dan@CERN’s reweighting script to use CRUSH
>>>> reweighting instead, and to squeeze from both ends, though I fear it might
>>>> not be as simple as it sounds prima fascia.
>>>>
>>>>
>>>> Aaron wrote:
>>>>
>>>>
>>>> Should I be expecting it to decide to increase some underutilized osds?
>>>>
>>>>
>>>> The osd reweight mechanism only accomodates an override weight between 0
>>>> and 1, thus it can decrease but not increase a given OSD’s fullness.  To
>>>> directly fill up underfull OSD’s it would seem to to need an override
>>>> weight > 1, which isn’t possible.
>>>>
>>>> I haven’t personally experienced it (yet), but from what I read, if
>>>> override reweighted OSD’s get marked out and back in again, their override
>>>> will revert to 1.  In a case where a cluster is running close to the full
>>>> ratio, this would *seem* as though a network glitch etc. might result in
>>>> some OSD’s filling up and hitting the full threshold, which would be bad.
>>>>
>>>> Using CRUSH reweight instead would seem to address both of these
>>>> shortcomings, though it does perturb the arbitrary but useful way that
>>>> initial CRUSH weights by default reflect the capacity of each OSD.
>>>> Various references  also indicate that the override reweight does not
>>>> change the weight of buckets above the OSD, but that CRUSH reweight does.
>>>> I haven’t found any discussion of the ramifications of this, but my inital
>>>> stab at it would be that when one does the 0-1 override reweight, the
>>>> “extra’ data is redistributed to OSD’s on the same node.  CRUSH
>>>> reweighting would then seem to pull / push the wad of data being adjusted
>>>> from / to *other* OSD nodes.  Or it could be that I’m out of my Vulcan
>>>> mind.
>>>>
>>>> Thus adjusting the weight of a given OSD affects the fullness of other
>>>> OSD’s, in ways that would seem to differ depending on which method is
>>>> used.  As I think you implied in one of your messages, sometimes this can
>>>> result in the fullness of one or more OSD’s climbing relatively sharply,
>>>> even to a point distinctly above where the previous most-full OSDs were.
>>>>
>>>> I lurked in the recent developer’s meeting where strategies for A Better
>>>> Way in Luminous were discussed.  While the plans are exciting and hold
>>>> promise for uniform and thus greater safe utilization of a cluster’s raw
>>>> space, I suspect though that between dev/test time and the attrition
>>>> needed to update running clients, those of us running existing RBD
>>>> clusters won’t be able to take advantage of them for some time.
>>>>
>>>> — Anthony
>>>>
>>>>
>>>> _______________________________________________
>>>> Ceph-large mailing list
>>>> Ceph-large@xxxxxxxxxxxxxx<mailto:Ceph-large@xxxxxxxxxxxxxx>
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dlarge-2Dceph.com&d=DwIDaQ&c=Tpa2GKmmYSmpYS4baANxQwQYqA0vwGXwkJOPBegaiTs&r=5nKer5huNDFQXjYpOR4o_7t5CRI8wb5Vb_v1pBywbYw&m=7EBxahYku433mvnl5QbCdwy1YeskN0Qg-6zAI26VwlU&s=l7VPVg2xTBRWzcj5eWL4xrbKcp58xthSIjbfO4gkafU&e=
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Ceph-large mailing list
>>>> Ceph-large@xxxxxxxxxxxxxx
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dlarge-2Dceph.com&d=DwIDaQ&c=Tpa2GKmmYSmpYS4baANxQwQYqA0vwGXwkJOPBegaiTs&r=5nKer5huNDFQXjYpOR4o_7t5CRI8wb5Vb_v1pBywbYw&m=7EBxahYku433mvnl5QbCdwy1YeskN0Qg-6zAI26VwlU&s=l7VPVg2xTBRWzcj5eWL4xrbKcp58xthSIjbfO4gkafU&e=
>>
> _______________________________________________
> Ceph-large mailing list
> Ceph-large@xxxxxxxxxxxxxx
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dlarge-2Dceph.com&d=DwICAg&c=Tpa2GKmmYSmpYS4baANxQwQYqA0vwGXwkJOPBegaiTs&r=5nKer5huNDFQXjYpOR4o_7t5CRI8wb5Vb_v1pBywbYw&m=7EBxahYku433mvnl5QbCdwy1YeskN0Qg-6zAI26VwlU&s=l7VPVg2xTBRWzcj5eWL4xrbKcp58xthSIjbfO4gkafU&e=

CONFIDENTIALITY NOTICE
This e-mail message and any attachments are only for the use of the intended recipient and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If you are not the intended recipient, any disclosure, distribution or other use of this e-mail message or attachments is prohibited. If you have received this e-mail message in error, please delete and notify the sender immediately. Thank you.
_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com