Re: Rebalancing

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 20 Apr 2017 15:44:48 +0000 (UTC)

On Thu, 20 Apr 2017, Aaron Bassett wrote:
> Ahh nm I got it:  ceph osd test-reweight-by-utilization 110
> no change
> moved 56 / 278144 (0.0201335%)
> avg 259.948
> stddev 15.9527 -> 15.9079 (expected baseline 16.1154)
> min osd.512 with 217 -> 217 pgs (0.834783 -> 0.834783 * mean)
> max osd.870 with 314 -> 314 pgs (1.20794 -> 1.20794 * mean)
> 
> oload 110
> max_change 0.05
> max_change_osds 4
> average 0.719019
> overload 0.790921
> osd.1038 weight 1.000000 -> 0.950012
> osd.10 weight 1.000000 -> 0.950012
> osd.481 weight 1.000000 -> 0.950012
> osd.613 weight 1.000000 -> 0.950012

You might try walking down from 120 to 110, and changing more than 4 osds 
at a time.

> This is only changing the ephemeral weight? Is that going to be an issue if
> I need to apply an update and restart osds?

This is changing the confusingly-named 'osd reweight' value, which is 
designed to do exactly this.  It won't get clobbered by an osd restart.

sage

> Aaron 
> 
>       On Apr 20, 2017, at 11:35 AM, Aaron Bassett
>       <Aaron.Bassett@xxxxxxxxxxxxx> wrote:
> 
> 
>       On Apr 20, 2017, at 11:27 AM, Sage Weil
>       <sage@xxxxxxxxxxxx> wrote:
> 
> On Thu, 20 Apr 2017, Aaron Bassett wrote:
>       Good morning,
>       I have a large (1000) osd cluster running Jewel
>       (10.2.6). It's an object store cluster, just using
>       RGW with two EC pools of different redundancies.
>       Tunable are optimal:
> 
>       ceph osd crush show-tunables
>       {
>          "choose_local_tries": 0,
>          "choose_local_fallback_tries": 0,
>          "choose_total_tries": 50,
>          "chooseleaf_descend_once": 1,
>          "chooseleaf_vary_r": 1,
>          "chooseleaf_stable": 1,
>          "straw_calc_version": 1,
>          "allowed_bucket_algs": 54,
>          "profile": "jewel",
>          "optimal_tunables": 1,
>          "legacy_tunables": 0,
>          "minimum_required_version": "jewel",
>          "require_feature_tunables": 1,
>          "require_feature_tunables2": 1,
>          "has_v2_rules": 1,
>          "require_feature_tunables3": 1,
>          "has_v3_rules": 0,
>          "has_v4_buckets": 0,
>          "require_feature_tunables5": 1,
>          "has_v5_rules": 0
>       }
> 
> 
>       It's about 72% full and I'm starting to hit the
>       dreaded "nearfull" 
>       warnings. My osd utilizations range from 59% to 85%.
>       My current approach 
>       has been to use "ceph osd crush reweight" to knock a
>       few points off the 
>       weight of any osds that are > 84% utilized. I
>       realized I should also 
>       probably be bumping up the weights of some osds at
>       the low end to help 
>       direct the data in the right direction, but I have
>       not started doing 
>       that yet.  It's getting a bit complicated as I'm
>       having some I've 
>       already weighted down pop back up again, so it takes
>       a lot of care to do 
>       it right and not screw up in a way that would move a
>       lot of data 
>       unnecessarily, or get into a backfill_toofull
>       situation.
> 
>       FWIW, in the past on an older cluster running Hammer
>       I believe, I had 
>       used rewight_by_utilization in this situation. That
>       ended poorly as it 
>       lowered some of the weights so low that crush was
>       unable to place some 
>       pgs leading me to a lengthy process of manually
>       correcting. Also this 
>       cluster is much larger than that one was and I'm
>       hesitant to try to 
>       shuffle so much data at once.
> 
> 
> That problem has been fixed; I'd try the new jewel version.
> 
>       This is the out of ceph osd
>       test-reweight-by-utilization:
>       no change
>       moved 0 / 278144 (0%)
>       avg 259.948
>       stddev 15.9527 -> 15.9527 (expected baseline
>       16.1154)
>       min osd.512 with 217 -> 217 pgs (0.834783 ->
>       0.834783 * mean)
>       max osd.870 with 314 -> 314 pgs (1.20794 -> 1.20794
>       * mean)
> 
>       oload 120
>       max_change 0.05
>       max_change_osds 4
>       average 0.719013
>       overload 0.862816
> 
> 
> ...and I'm guessing that this isn't doing anything because the
> default 
> oload value of 120 is too high for you.  Try setting that to 110
> and 
> re-running test-rewight-by-utilization to see what it will do.
> 
> 
> Google is failing me on oload, are there docs you can point me at?
> 
> 
>             So just wondering if anyone has any advice for
>             me here, or if I should 
>             carry on as is. I would like to get overall
>             utilization up to at least 
>             80% before calling it full and moving on to
>             another, as with a cluster 
>             this size, those last few percent represent
>             quite a lot of space.
> 
> 
>       Note that in luminous we have a few mechanisms in place
>       that will let you 
>       get to an essentially perfect distribution (yay, finally!)
>       so this is a 
>       short-term problem to get through... at least until you
>       can get all 
>       clients for the cluster using luminous as well.  Since
>       this is an rgw 
>       cluster that shouldn't be a problem for you!
> 
> Thats great to hear, I'm hoping to do the next cluster on
> Luminous/Bluestore, but its going to depend how long I can keep
> shoveling data into this one!
> 
> 
> 
> 
>       sage
> 
> 
>       CONFIDENTIALITY NOTICE
>       This e-mail message and any attachments are only for the use
>       of the intended recipient and may contain information that is
>       privileged, confidential or exempt from disclosure under
>       applicable law. If you are not the intended recipient, any
>       disclosure, distribution or other use of this e-mail message
>       or attachments is prohibited. If you have received this e-mail
>       message in error, please delete and notify the sender
>       immediately. Thank you.
> 
> _______________________________________________
> Ceph-large mailing list
> Ceph-large@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com
> 
> 
> 
> 
_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com