Re: Rebalancing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Wow awesome! I'd love to take a look at the script. I actually have two pools with the bulk of the data, as they have different data safety requirements:

.rgw.buckets                     59     2431T     75.31          797T     641506391
.rgw.buckets.stor-lite           58     1157T     58.08          835T     305047640

everything else is just the various rgw metadata stuff. 

Aaron 

On Apr 20, 2017, at 12:06 PM, David Turner <drakonstein@xxxxxxxxx> wrote:

... That moment when you're writing up a response and Sage beats you to the punch.  It's very exciting that Luminous/Bluestore is potentially going to resolve this for good.  I haven't had the best of luck with the built-in test-reweight-by-utilization and opted for external scripting to attain the best OSD distribution.  I've had clusters with over 1,000 OSDs that were 75% full be balanced within 2% from the most used OSD to the least.

I do this by balancing the CRUSH map instead of trying to balance the cluster.  What that means is that the script takes an offline version of the map and using the osdmaptool and crushtool it balances the weights in the cluster until the amount of PGs on each OSD is balanced .  Once you have a balanced map, you can upload the new CRUSH map into the cluster and with 1 map update, your cluster will balance to an optimal state.

I have 2 different versions of the tool.  The optimal version requires that you have 1 pool with a drastic majority of the cluster data.  The second version will do its best to balance the cluster with however many pools you have with whatever % of data each has.

I can email you the script with a rough guide on how to use it, or you can send me a copy of your crushmap and the output of a couple commands for me to tie it into the tool and send you back a map to test.

On Thu, Apr 20, 2017 at 11:52 AM Aaron Bassett <Aaron.Bassett@xxxxxxxxxxxxx> wrote:

> On Apr 20, 2017, at 11:44 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>
> On Thu, 20 Apr 2017, Aaron Bassett wrote:
>> Ahh nm I got it:  ceph osd test-reweight-by-utilization 110
>> no change
>> moved 56 / 278144 (0.0201335%)
>> avg 259.948
>> stddev 15.9527 -> 15.9079 (expected baseline 16.1154)
>> min osd.512 with 217 -> 217 pgs (0.834783 -> 0.834783 * mean)
>> max osd.870 with 314 -> 314 pgs (1.20794 -> 1.20794 * mean)
>>
>> oload 110
>> max_change 0.05
>> max_change_osds 4
>> average 0.719019
>> overload 0.790921
>> osd.1038 weight 1.000000 -> 0.950012
>> osd.10 weight 1.000000 -> 0.950012
>> osd.481 weight 1.000000 -> 0.950012
>> osd.613 weight 1.000000 -> 0.950012
>
> You might try walking down from 120 to 110, and changing more than 4 osds
> at a time.

What should I be looking for as a good action to take? It looks like it just wants to do very similar to what I would have done:

ceph osd df  |sort -k7 -n |tail -20
 714 7.27100  1.00000 7446G 6108G 1337G 82.04 1.14 283
 601 7.27100  1.00000 7446G 6109G 1336G 82.05 1.14 297
 916 7.27100  1.00000 7446G 6135G 1310G 82.40 1.15 295
 116 6.89999  1.00000 7446G 6139G 1306G 82.46 1.15 278
1097 7.27100  1.00000 7446G 6142G 1303G 82.49 1.15 289
1029 7.27100  1.00000 7446G 6150G 1295G 82.60 1.15 292
  75 7.27100  1.00000 7446G 6165G 1280G 82.81 1.15 294
 490 7.27100  1.00000 7446G 6169G 1276G 82.86 1.15 293
 919 7.27100  1.00000 7446G 6172G 1273G 82.90 1.15 293
 502 7.27100  1.00000 7446G 6183G 1262G 83.05 1.16 293
 310 7.27100  1.00000 7446G 6184G 1261G 83.06 1.16 303
1011 7.27100  1.00000 7446G 6204G 1241G 83.33 1.16 297
 910 7.27100  1.00000 7446G 6205G 1240G 83.34 1.16 301
 678 7.27100  1.00000 7446G 6223G 1222G 83.59 1.16 285
 853 7.27100  1.00000 7446G 6234G 1211G 83.72 1.16 286
 498 7.27100  1.00000 7446G 6241G 1204G 83.82 1.17 294
 613 7.27100  1.00000 7446G 6246G 1199G 83.90 1.17 275
 481 7.27100  1.00000 7446G 6268G 1177G 84.19 1.17 293
  10 7.27100  1.00000 7446G 6296G 1149G 84.56 1.18 297
1038 6.79999  1.00000 7446G 6319G 1126G 84.87 1.18 281
root@phx-r2-r1-head1:~# ceph osd test-reweight-by-utilization 110 0.05 20
no change
moved 302 / 278144 (0.108577%)
avg 259.948
stddev 15.9527 -> 15.6665 (expected baseline 16.1154)
min osd.512 with 217 -> 217 pgs (0.834783 -> 0.834783 * mean)
max osd.870 with 314 -> 314 pgs (1.20794 -> 1.20794 * mean)

oload 110
max_change 0.05
max_change_osds 20
average 0.719026
overload 0.790929
osd.1038 weight 1.000000 -> 0.950012
osd.10 weight 1.000000 -> 0.950012
osd.481 weight 1.000000 -> 0.950012
osd.613 weight 1.000000 -> 0.950012
osd.498 weight 1.000000 -> 0.950012
osd.678 weight 1.000000 -> 0.950012
osd.502 weight 1.000000 -> 0.950012
osd.490 weight 1.000000 -> 0.950012
osd.1029 weight 1.000000 -> 0.950012
osd.1097 weight 1.000000 -> 0.950012
osd.116 weight 1.000000 -> 0.950012
osd.601 weight 1.000000 -> 0.950012
osd.714 weight 1.000000 -> 0.950012
osd.60 weight 1.000000 -> 0.950012
osd.503 weight 1.000000 -> 0.950012
osd.689 weight 1.000000 -> 0.950012
osd.446 weight 1.000000 -> 0.950012
osd.508 weight 1.000000 -> 0.950012
osd.506 weight 1.000000 -> 0.950012
osd.374 weight 1.000000 -> 0.950012


Should I be expecting it to decide to increase some underutilized osds?

Aaron

>
>> This is only changing the ephemeral weight? Is that going to be an issue if
>> I need to apply an update and restart osds?
>
> This is changing the confusingly-named 'osd reweight' value, which is
> designed to do exactly this.  It won't get clobbered by an osd restart.
>
> sage
>
>
>> Aaron
>>
>>      On Apr 20, 2017, at 11:35 AM, Aaron Bassett
>>      <Aaron.Bassett@xxxxxxxxxxxxx> wrote:
>>
>>
>>      On Apr 20, 2017, at 11:27 AM, Sage Weil
>>      <sage@xxxxxxxxxxxx> wrote:
>>
>> On Thu, 20 Apr 2017, Aaron Bassett wrote:
>>      Good morning,
>>      I have a large (1000) osd cluster running Jewel
>>      (10.2.6). It's an object store cluster, just using
>>      RGW with two EC pools of different redundancies.
>>      Tunable are optimal:
>>
>>      ceph osd crush show-tunables
>>      {
>>         "choose_local_tries": 0,
>>         "choose_local_fallback_tries": 0,
>>         "choose_total_tries": 50,
>>         "chooseleaf_descend_once": 1,
>>         "chooseleaf_vary_r": 1,
>>         "chooseleaf_stable": 1,
>>         "straw_calc_version": 1,
>>         "allowed_bucket_algs": 54,
>>         "profile": "jewel",
>>         "optimal_tunables": 1,
>>         "legacy_tunables": 0,
>>         "minimum_required_version": "jewel",
>>         "require_feature_tunables": 1,
>>         "require_feature_tunables2": 1,
>>         "has_v2_rules": 1,
>>         "require_feature_tunables3": 1,
>>         "has_v3_rules": 0,
>>         "has_v4_buckets": 0,
>>         "require_feature_tunables5": 1,
>>         "has_v5_rules": 0
>>      }
>>
>>
>>      It's about 72% full and I'm starting to hit the
>>      dreaded "nearfull"
>>      warnings. My osd utilizations range from 59% to 85%.
>>      My current approach
>>      has been to use "ceph osd crush reweight" to knock a
>>      few points off the
>>      weight of any osds that are > 84% utilized. I
>>      realized I should also
>>      probably be bumping up the weights of some osds at
>>      the low end to help
>>      direct the data in the right direction, but I have
>>      not started doing
>>      that yet.  It's getting a bit complicated as I'm
>>      having some I've
>>      already weighted down pop back up again, so it takes
>>      a lot of care to do
>>      it right and not screw up in a way that would move a
>>      lot of data
>>      unnecessarily, or get into a backfill_toofull
>>      situation.
>>
>>      FWIW, in the past on an older cluster running Hammer
>>      I believe, I had
>>      used rewight_by_utilization in this situation. That
>>      ended poorly as it
>>      lowered some of the weights so low that crush was
>>      unable to place some
>>      pgs leading me to a lengthy process of manually
>>      correcting. Also this
>>      cluster is much larger than that one was and I'm
>>      hesitant to try to
>>      shuffle so much data at once.
>>
>>
>> That problem has been fixed; I'd try the new jewel version.
>>
>>      This is the out of ceph osd
>>      test-reweight-by-utilization:
>>      no change
>>      moved 0 / 278144 (0%)
>>      avg 259.948
>>      stddev 15.9527 -> 15.9527 (expected baseline
>>      16.1154)
>>      min osd.512 with 217 -> 217 pgs (0.834783 ->
>>      0.834783 * mean)
>>      max osd.870 with 314 -> 314 pgs (1.20794 -> 1.20794
>>      * mean)
>>
>>      oload 120
>>      max_change 0.05
>>      max_change_osds 4
>>      average 0.719013
>>      overload 0.862816
>>
>>
>> ...and I'm guessing that this isn't doing anything because the
>> default
>> oload value of 120 is too high for you.  Try setting that to 110
>> and
>> re-running test-rewight-by-utilization to see what it will do.
>>
>>
>> Google is failing me on oload, are there docs you can point me at?
>>
>>
>>            So just wondering if anyone has any advice for
>>            me here, or if I should
>>            carry on as is. I would like to get overall
>>            utilization up to at least
>>            80% before calling it full and moving on to
>>            another, as with a cluster
>>            this size, those last few percent represent
>>            quite a lot of space.
>>
>>
>>      Note that in luminous we have a few mechanisms in place
>>      that will let you
>>      get to an essentially perfect distribution (yay, finally!)
>>      so this is a
>>      short-term problem to get through... at least until you
>>      can get all
>>      clients for the cluster using luminous as well.  Since
>>      this is an rgw
>>      cluster that shouldn't be a problem for you!
>>
>> Thats great to hear, I'm hoping to do the next cluster on
>> Luminous/Bluestore, but its going to depend how long I can keep
>> shoveling data into this one!
>>
>>
>>
>>
>>      sage
>>
>>
>>      CONFIDENTIALITY NOTICE
>>      This e-mail message and any attachments are only for the use
>>      of the intended recipient and may contain information that is
>>      privileged, confidential or exempt from disclosure under
>>      applicable law. If you are not the intended recipient, any
>>      disclosure, distribution or other use of this e-mail message
>>      or attachments is prohibited. If you have received this e-mail
>>      message in error, please delete and notify the sender
>>      immediately. Thank you.
>>
>> _______________________________________________
>> Ceph-large mailing list
>> Ceph-large@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com
>>
>>
>>

CONFIDENTIALITY NOTICE
This e-mail message and any attachments are only for the use of the intended recipient and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If you are not the intended recipient, any disclosure, distribution or other use of this e-mail message or attachments is prohibited. If you have received this e-mail message in error, please delete and notify the sender immediately. Thank you.

_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com

_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFS]

  Powered by Linux