Re: Rebalancing

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 20 Apr 2017 15:27:25 +0000 (UTC)

On Thu, 20 Apr 2017, Aaron Bassett wrote:
> Good morning,
> I have a large (1000) osd cluster running Jewel (10.2.6). It's an object store cluster, just using RGW with two EC pools of different redundancies. Tunable are optimal:
> 
> ceph osd crush show-tunables
> {
>     "choose_local_tries": 0,
>     "choose_local_fallback_tries": 0,
>     "choose_total_tries": 50,
>     "chooseleaf_descend_once": 1,
>     "chooseleaf_vary_r": 1,
>     "chooseleaf_stable": 1,
>     "straw_calc_version": 1,
>     "allowed_bucket_algs": 54,
>     "profile": "jewel",
>     "optimal_tunables": 1,
>     "legacy_tunables": 0,
>     "minimum_required_version": "jewel",
>     "require_feature_tunables": 1,
>     "require_feature_tunables2": 1,
>     "has_v2_rules": 1,
>     "require_feature_tunables3": 1,
>     "has_v3_rules": 0,
>     "has_v4_buckets": 0,
>     "require_feature_tunables5": 1,
>     "has_v5_rules": 0
> }
> 
> 
> It's about 72% full and I'm starting to hit the dreaded "nearfull" 
> warnings. My osd utilizations range from 59% to 85%. My current approach 
> has been to use "ceph osd crush reweight" to knock a few points off the 
> weight of any osds that are > 84% utilized. I realized I should also 
> probably be bumping up the weights of some osds at the low end to help 
> direct the data in the right direction, but I have not started doing 
> that yet.  It's getting a bit complicated as I'm having some I've 
> already weighted down pop back up again, so it takes a lot of care to do 
> it right and not screw up in a way that would move a lot of data 
> unnecessarily, or get into a backfill_toofull situation.
> 
> FWIW, in the past on an older cluster running Hammer I believe, I had 
> used rewight_by_utilization in this situation. That ended poorly as it 
> lowered some of the weights so low that crush was unable to place some 
> pgs leading me to a lengthy process of manually correcting. Also this 
> cluster is much larger than that one was and I'm hesitant to try to 
> shuffle so much data at once.

That problem has been fixed; I'd try the new jewel version.

> This is the out of ceph osd test-reweight-by-utilization:
> no change
> moved 0 / 278144 (0%)
> avg 259.948
> stddev 15.9527 -> 15.9527 (expected baseline 16.1154)
> min osd.512 with 217 -> 217 pgs (0.834783 -> 0.834783 * mean)
> max osd.870 with 314 -> 314 pgs (1.20794 -> 1.20794 * mean)
> 
> oload 120
> max_change 0.05
> max_change_osds 4
> average 0.719013
> overload 0.862816

...and I'm guessing that this isn't doing anything because the default 
oload value of 120 is too high for you.  Try setting that to 110 and 
re-running test-rewight-by-utilization to see what it will do.

> So just wondering if anyone has any advice for me here, or if I should 
> carry on as is. I would like to get overall utilization up to at least 
> 80% before calling it full and moving on to another, as with a cluster 
> this size, those last few percent represent quite a lot of space.

Note that in luminous we have a few mechanisms in place that will let you 
get to an essentially perfect distribution (yay, finally!) so this is a 
short-term problem to get through... at least until you can get all 
clients for the cluster using luminous as well.  Since this is an rgw 
cluster that shouldn't be a problem for you!

sage
_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com