Re: throttling reweight-by-utilization

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 2 Mar 2016 17:18:15 -0500 (EST)

On Wed, 2 Mar 2016, Dan van der Ster wrote:
> > I'm mostly happy with it, but there are a couple annoying problems:
> >
> > 1) Adding the --no-increasing option (which prevents us from
> > weighting things up) effectively breaks the cli parsing for the optional
> > pool arguments for reweight-by-pg.  I'm not sure which is more useful; I
> > don't think I'd use either one.
> >
> 
> The use case for --no-increasing is to have a way to immediately free
> space from the fullest OSDs... useful in a crisis situation I suppose.
> Though, in my experience there are not often very many underweighted
> PGs... and maybe this option would cause a sort of race to 0.0 over
> time. An alternative would be less aggressive at increasing weights:
> 
> - // but aggressively adjust weights up whenever possible.
> - double underload_util = average_util;
> + // adjust up only if we are below the threshold
> + double underload_util = average_util - (overload_util - average_util)
> 
> Maybe even add a configurable osd_reweight_underload_factor on the
> (overload_util - average_util) term to make that optional. This is all
> fine-tuning... not sure how essential it is.

I think it's important to keep this logic weight-back-up bias in there in 
the general case. Even as things stand it tends to weight everything down 
over time so that a sort of inverse normal distribution of weights sits 
under 1.0.

The risk you're worried about is that we'll weight OSD A up, pushing some 
PG(s) to nearly-full OSD B, at the same time we weight OSD B down, and 
we'll be coping data *to* B while also copying data away from B.  In 
theory, the backfill reservation isn't supposed to let that happen if OSD 
B is approaching full.  But it depends on exactly what the full thresholds 
are configured to be...

Anyway, I think I have a compromise: have the --no-increasing 
option for [test-]reweight-by-utilization (which doesn't have a pool 
argument), and don't have it for [test-]reweight-by-pg.  I don't think 
people will be using the pg variant on a full cluster in an emergency 
anyway... the utilization variant makes more sense there.

sage

> On the other hand, pool args on reweight-by-pgs has two use-cases in
> my experience: I usually use pool args because I know that only a
> couple (out of say 15) pools have all the data -- it's not worth it
> (and possibly ineffective) to balance the empty PGs around.
> 
> The other (and probably more important) use-case for pool args is when
> you have a non-uniform crush tree -- e.g. some part of the tree should
> by-design get more PGs than other parts of the tree. This second
> use-case would be better served by an option to reweight-by-pgs to
> only reweight OSDs beneath a given crush bucket. But I thought that
> might be a challenge to implement, and anyway not sure if replacing a
> pool arg with a bucket arg solves the cli parsing problem. If it is
> however somehow doable, then we could probably do away with the pool
> args.
> 
> 
> > 2) The stats are all based on pg counts.  It might be possible to estimate
> > new stats using the pgmap and estimating storage overhead and a bunch
> > of other ugly hackery so that the rewieght-by-utilization case would show
> > stats in terms of bytes, but it'd be a lot of work, and I don't think it's
> > worth it.  That means that even though the reweight-by-utilization
> > adjsutments are done based on actual osd utilizations, the
> > before/after stats it shows are in terms of pgs.
> 
> I think PG counts are OK.
> 
> > Thoughts?
> 
> Thanks!
> 
> dan
> 
> 
> > sage
> >
> >
> >
> > On Tue, 1 Mar 2016, Dan van der Ster wrote:
> >
> >> Hi Sage,
> >> I have a wip here: https://github.com/cernceph/ceph/commits/wip-reweight
> >> I still need to spin up a small test cluster to test it.
> >> -- Dan
> >>
> >> On Mon, Jan 18, 2016 at 4:25 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> >> > On Mon, Jan 18, 2016 at 3:06 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> >> >> On Mon, 18 Jan 2016, Dan van der Ster wrote:
> >> >>> Hi,
> >> >>>
> >> >>> I'd like to propose a few changes to reweight-by-utilization which
> >> >>> will make it significantly less scary:
> >> >>>
> >> >>> 1. Change reweight-by-utilization to run in "dry run" -- display only
> >> >>> -- mode unless an admin runs with  --yes-i-really-really-mean-it. This
> >> >>> way admins can see what will be reweighted before committing to any
> >> >>> changes.
> >> >>
> >> >> I think this piece is key, and there is a lot we might do here to make
> >> >> this more informative.  In particular, we have the (approx) sizes of each
> >> >> PG(*) and can calculate their mapping after the proposed change, which
> >> >> means we could show the min/max utilization, standard deviation, and/or
> >> >> number of nearfull or full OSDs before and after.
> >> >>
> >> >
> >> > I hadn't thought of that, but it would be cool.
> >> > My main use of the dry-run functionality is to ensure that it's not
> >> > going to change too many OSDs at once (IOW letting me try different
> >> > oload and pool values). Maybe some users want to make a large change
> >> > all in one go -- in that case this would be useful.
> >> >
> >> >
> >> >> * Almost... we don't really know how many bytes of key/value omap data are
> >> >> consumed.  So we could either go by the user data accounting, which is a
> >> >> lower bound, or average the OSD utilization by the PGs it stores
> >> >> (averaging pools together), or try do the same for just the difference
> >> >> (which would presumably be omap data + overall overhead).
> >> >>
> >> >> I'm not sure how much it is worth trying to be accurate here...
> >> >>
> >> >>> 2. Add a configurable to limit the number of OSDs changed per execution:
> >> >>>   mon_reweight_max_osds_changed (default 4)
> >> >>>
> >> >>> 3. Add a configurable to limit the weight changed per OSD:
> >> >>>   mon_reweight_max_weight_change (default 0.05)
> >> >>>
> >> >>> Along with (2) and (3), the main loop in reweight_by_utilization:
> >> >>>   https://github.com/ceph/ceph/blob/master/src/mon/OSDMonitor.cc#L568
> >> >>> needs to sort the OSDs by utilization.
> >> >>>
> >> >>> 4. Make adjusting weights up optional with a new CLI option
> >> >>> --adjust-up. This is useful because if you have nearly full OSDs you
> >> >>> want to prioritize making space on those OSDs.
> >> >>
> >> >> These sound reasonable to me.  Although, in general, if we ultimately want
> >> >> people do to this regularly via cron or something, we'll need --adjust-up.
> >> >> I wonder if there is some other way it should be biased so that we weight
> >> >> the overfull stuff down before weighting the underfull stuff up.  Maybe
> >> >> the max_osds_changed already mostly does that by doing the fullest osds
> >> >> first?
> >> >
> >> > One other reason we need to be able to disable --adjust-up is for
> >> > non-flat crush trees, where some OSDs get more PGs because of a non
> >> > trivial ruleset. reweight-by-pg helps in this case, but its still not
> >> > perfect for all crush layouts. For example, we have a ruleset which
> >> > puts two replicas in one root and a third replica in another root...
> >> > this is difficult for the reweight-by-* function to get right. One
> >> > solution to this which I haven't yet tried is to add a --bucket option
> >> > to reweight-by-* which only looks at OSDs under a given point in the
> >> > tree.
> >> >
> >> > -- Dan
> >> >
> >> >
> >> >
> >> >> Thanks, Dan!
> >> >> sage
> >> >>
> >> >>
> >> >>
> >> >>> I have already been running with these options in a python prototype:
> >> >>>   https://github.com/cernceph/ceph-scripts/blob/master/tools/crush-reweight-by-utilization.py
> >> >>>
> >> >>> If you agree I'll port these changes to OSDMonitor.cc and send a PR.
> >> >>>
> >> >>> Best Regards,
> >> >>> Dan
> >> >>> --
> >> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> >>> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>>
> >> >>>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html