Hi Sage, I have a wip here: https://github.com/cernceph/ceph/commits/wip-reweight I still need to spin up a small test cluster to test it. -- Dan On Mon, Jan 18, 2016 at 4:25 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > On Mon, Jan 18, 2016 at 3:06 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >> On Mon, 18 Jan 2016, Dan van der Ster wrote: >>> Hi, >>> >>> I'd like to propose a few changes to reweight-by-utilization which >>> will make it significantly less scary: >>> >>> 1. Change reweight-by-utilization to run in "dry run" -- display only >>> -- mode unless an admin runs with --yes-i-really-really-mean-it. This >>> way admins can see what will be reweighted before committing to any >>> changes. >> >> I think this piece is key, and there is a lot we might do here to make >> this more informative. In particular, we have the (approx) sizes of each >> PG(*) and can calculate their mapping after the proposed change, which >> means we could show the min/max utilization, standard deviation, and/or >> number of nearfull or full OSDs before and after. >> > > I hadn't thought of that, but it would be cool. > My main use of the dry-run functionality is to ensure that it's not > going to change too many OSDs at once (IOW letting me try different > oload and pool values). Maybe some users want to make a large change > all in one go -- in that case this would be useful. > > >> * Almost... we don't really know how many bytes of key/value omap data are >> consumed. So we could either go by the user data accounting, which is a >> lower bound, or average the OSD utilization by the PGs it stores >> (averaging pools together), or try do the same for just the difference >> (which would presumably be omap data + overall overhead). >> >> I'm not sure how much it is worth trying to be accurate here... >> >>> 2. Add a configurable to limit the number of OSDs changed per execution: >>> mon_reweight_max_osds_changed (default 4) >>> >>> 3. Add a configurable to limit the weight changed per OSD: >>> mon_reweight_max_weight_change (default 0.05) >>> >>> Along with (2) and (3), the main loop in reweight_by_utilization: >>> https://github.com/ceph/ceph/blob/master/src/mon/OSDMonitor.cc#L568 >>> needs to sort the OSDs by utilization. >>> >>> 4. Make adjusting weights up optional with a new CLI option >>> --adjust-up. This is useful because if you have nearly full OSDs you >>> want to prioritize making space on those OSDs. >> >> These sound reasonable to me. Although, in general, if we ultimately want >> people do to this regularly via cron or something, we'll need --adjust-up. >> I wonder if there is some other way it should be biased so that we weight >> the overfull stuff down before weighting the underfull stuff up. Maybe >> the max_osds_changed already mostly does that by doing the fullest osds >> first? > > One other reason we need to be able to disable --adjust-up is for > non-flat crush trees, where some OSDs get more PGs because of a non > trivial ruleset. reweight-by-pg helps in this case, but its still not > perfect for all crush layouts. For example, we have a ruleset which > puts two replicas in one root and a third replica in another root... > this is difficult for the reweight-by-* function to get right. One > solution to this which I haven't yet tried is to add a --bucket option > to reweight-by-* which only looks at OSDs under a given point in the > tree. > > -- Dan > > > >> Thanks, Dan! >> sage >> >> >> >>> I have already been running with these options in a python prototype: >>> https://github.com/cernceph/ceph-scripts/blob/master/tools/crush-reweight-by-utilization.py >>> >>> If you agree I'll port these changes to OSDMonitor.cc and send a PR. >>> >>> Best Regards, >>> Dan >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html