Re: throttling reweight-by-utilization

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 18 Jan 2016 09:06:19 -0500 (EST)

On Mon, 18 Jan 2016, Dan van der Ster wrote:
> Hi,
> 
> I'd like to propose a few changes to reweight-by-utilization which
> will make it significantly less scary:
> 
> 1. Change reweight-by-utilization to run in "dry run" -- display only
> -- mode unless an admin runs with  --yes-i-really-really-mean-it. This
> way admins can see what will be reweighted before committing to any
> changes.

I think this piece is key, and there is a lot we might do here to make 
this more informative.  In particular, we have the (approx) sizes of each 
PG(*) and can calculate their mapping after the proposed change, which 
means we could show the min/max utilization, standard deviation, and/or 
number of nearfull or full OSDs before and after.

* Almost... we don't really know how many bytes of key/value omap data are 
consumed.  So we could either go by the user data accounting, which is a 
lower bound, or average the OSD utilization by the PGs it stores 
(averaging pools together), or try do the same for just the difference 
(which would presumably be omap data + overall overhead).

I'm not sure how much it is worth trying to be accurate here...

> 2. Add a configurable to limit the number of OSDs changed per execution:
>   mon_reweight_max_osds_changed (default 4)
> 
> 3. Add a configurable to limit the weight changed per OSD:
>   mon_reweight_max_weight_change (default 0.05)
> 
> Along with (2) and (3), the main loop in reweight_by_utilization:
>   https://github.com/ceph/ceph/blob/master/src/mon/OSDMonitor.cc#L568
> needs to sort the OSDs by utilization.
> 
> 4. Make adjusting weights up optional with a new CLI option
> --adjust-up. This is useful because if you have nearly full OSDs you
> want to prioritize making space on those OSDs.

These sound reasonable to me.  Although, in general, if we ultimately want 
people do to this regularly via cron or something, we'll need --adjust-up.  
I wonder if there is some other way it should be biased so that we weight 
the overfull stuff down before weighting the underfull stuff up.  Maybe 
the max_osds_changed already mostly does that by doing the fullest osds 
first?

Thanks, Dan!
sage

> I have already been running with these options in a python prototype:
>   https://github.com/cernceph/ceph-scripts/blob/master/tools/crush-reweight-by-utilization.py
> 
> If you agree I'll port these changes to OSDMonitor.cc and send a PR.
> 
> Best Regards,
> Dan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html