On Wed, Mar 2, 2016 at 8:15 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > Latest branch: > > https://github.com/liewegas/ceph/commits/wip-reweight > > I made several changes: > > - new command, 'osd utilization' > > avg 51 > stddev 0.707107 (expected baseline 6.68019) > min osd.3 with 52 pgs (1.01961 * mean) > max osd.3 with 50 pgs (0.980392 * mean) > > (json version too) > > - new commands: osd test-reweight-by-{pg,utilization} > > same as osd reweight-by-*, but it doesn't actually to it--just shows the > before and after stats: > > no change > moved 2 / 408 (0.490196%) > avg 51 > stddev 5.26783 -> 4.74342 (expected baseline 6.68019) > min osd.3 with 59 -> 57 pgs (1.15686 -> 1.11765 * mean) > max osd.3 with 41 -> 42 pgs (0.803922 -> 0.823529 * mean) > > oload 105 > max_change 0.05 > average 51.000000 > overload 53.550000 > osd.3 weight 1.000000 -> 0.950012 > osd.2 weight 1.000000 -> 0.950012 > Those look good to me. > > I'm mostly happy with it, but there are a couple annoying problems: > > 1) Adding the --no-increasing option (which prevents us from > weighting things up) effectively breaks the cli parsing for the optional > pool arguments for reweight-by-pg. I'm not sure which is more useful; I > don't think I'd use either one. > The use case for --no-increasing is to have a way to immediately free space from the fullest OSDs... useful in a crisis situation I suppose. Though, in my experience there are not often very many underweighted PGs... and maybe this option would cause a sort of race to 0.0 over time. An alternative would be less aggressive at increasing weights: - // but aggressively adjust weights up whenever possible. - double underload_util = average_util; + // adjust up only if we are below the threshold + double underload_util = average_util - (overload_util - average_util) Maybe even add a configurable osd_reweight_underload_factor on the (overload_util - average_util) term to make that optional. This is all fine-tuning... not sure how essential it is. On the other hand, pool args on reweight-by-pgs has two use-cases in my experience: I usually use pool args because I know that only a couple (out of say 15) pools have all the data -- it's not worth it (and possibly ineffective) to balance the empty PGs around. The other (and probably more important) use-case for pool args is when you have a non-uniform crush tree -- e.g. some part of the tree should by-design get more PGs than other parts of the tree. This second use-case would be better served by an option to reweight-by-pgs to only reweight OSDs beneath a given crush bucket. But I thought that might be a challenge to implement, and anyway not sure if replacing a pool arg with a bucket arg solves the cli parsing problem. If it is however somehow doable, then we could probably do away with the pool args. > 2) The stats are all based on pg counts. It might be possible to estimate > new stats using the pgmap and estimating storage overhead and a bunch > of other ugly hackery so that the rewieght-by-utilization case would show > stats in terms of bytes, but it'd be a lot of work, and I don't think it's > worth it. That means that even though the reweight-by-utilization > adjsutments are done based on actual osd utilizations, the > before/after stats it shows are in terms of pgs. I think PG counts are OK. > Thoughts? Thanks! dan > sage > > > > On Tue, 1 Mar 2016, Dan van der Ster wrote: > >> Hi Sage, >> I have a wip here: https://github.com/cernceph/ceph/commits/wip-reweight >> I still need to spin up a small test cluster to test it. >> -- Dan >> >> On Mon, Jan 18, 2016 at 4:25 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: >> > On Mon, Jan 18, 2016 at 3:06 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >> >> On Mon, 18 Jan 2016, Dan van der Ster wrote: >> >>> Hi, >> >>> >> >>> I'd like to propose a few changes to reweight-by-utilization which >> >>> will make it significantly less scary: >> >>> >> >>> 1. Change reweight-by-utilization to run in "dry run" -- display only >> >>> -- mode unless an admin runs with --yes-i-really-really-mean-it. This >> >>> way admins can see what will be reweighted before committing to any >> >>> changes. >> >> >> >> I think this piece is key, and there is a lot we might do here to make >> >> this more informative. In particular, we have the (approx) sizes of each >> >> PG(*) and can calculate their mapping after the proposed change, which >> >> means we could show the min/max utilization, standard deviation, and/or >> >> number of nearfull or full OSDs before and after. >> >> >> > >> > I hadn't thought of that, but it would be cool. >> > My main use of the dry-run functionality is to ensure that it's not >> > going to change too many OSDs at once (IOW letting me try different >> > oload and pool values). Maybe some users want to make a large change >> > all in one go -- in that case this would be useful. >> > >> > >> >> * Almost... we don't really know how many bytes of key/value omap data are >> >> consumed. So we could either go by the user data accounting, which is a >> >> lower bound, or average the OSD utilization by the PGs it stores >> >> (averaging pools together), or try do the same for just the difference >> >> (which would presumably be omap data + overall overhead). >> >> >> >> I'm not sure how much it is worth trying to be accurate here... >> >> >> >>> 2. Add a configurable to limit the number of OSDs changed per execution: >> >>> mon_reweight_max_osds_changed (default 4) >> >>> >> >>> 3. Add a configurable to limit the weight changed per OSD: >> >>> mon_reweight_max_weight_change (default 0.05) >> >>> >> >>> Along with (2) and (3), the main loop in reweight_by_utilization: >> >>> https://github.com/ceph/ceph/blob/master/src/mon/OSDMonitor.cc#L568 >> >>> needs to sort the OSDs by utilization. >> >>> >> >>> 4. Make adjusting weights up optional with a new CLI option >> >>> --adjust-up. This is useful because if you have nearly full OSDs you >> >>> want to prioritize making space on those OSDs. >> >> >> >> These sound reasonable to me. Although, in general, if we ultimately want >> >> people do to this regularly via cron or something, we'll need --adjust-up. >> >> I wonder if there is some other way it should be biased so that we weight >> >> the overfull stuff down before weighting the underfull stuff up. Maybe >> >> the max_osds_changed already mostly does that by doing the fullest osds >> >> first? >> > >> > One other reason we need to be able to disable --adjust-up is for >> > non-flat crush trees, where some OSDs get more PGs because of a non >> > trivial ruleset. reweight-by-pg helps in this case, but its still not >> > perfect for all crush layouts. For example, we have a ruleset which >> > puts two replicas in one root and a third replica in another root... >> > this is difficult for the reweight-by-* function to get right. One >> > solution to this which I haven't yet tried is to add a --bucket option >> > to reweight-by-* which only looks at OSDs under a given point in the >> > tree. >> > >> > -- Dan >> > >> > >> > >> >> Thanks, Dan! >> >> sage >> >> >> >> >> >> >> >>> I have already been running with these options in a python prototype: >> >>> https://github.com/cernceph/ceph-scripts/blob/master/tools/crush-reweight-by-utilization.py >> >>> >> >>> If you agree I'll port these changes to OSDMonitor.cc and send a PR. >> >>> >> >>> Best Regards, >> >>> Dan >> >>> -- >> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >> >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >>> >> >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html