On Fri, Jul 28, 2017 at 9:21 AM, Sage Weil <sage@xxxxxxxxxx> wrote: > Hi all, > > I've been working off and on on a mgr module 'balancer' that will do > automatically optimization of the pg distribution. The idea is you'll > eventually be able to just turn it on and it will slowly and continuously > optimize the layout without having to think about it. > > I got something basic implemented pretty quickly that wraps around the new > pg-upmap optimizer embedded in OSDMap.cc and osdmaptool. And I had > something that adjust the compat weight-set (optimizing crush weights in a > backward-compatible way) that sort of kind of worked, but its problem was > that it worked against the actual cluster instead of a model of the > cluster, which meant it didn't always know whether a change it was making > was going to be a good one until it tried it (and moved a bunch of data > round). The conclusion from that was that the optmizer, regardless of what > method it was using (upmap, crush weights, osd weights) had to operate > against a model of the system so that it could check whether its changes > were good ones before making them. > > I got enough of the OSDMap, OSDMap::Incremental, and CrushWrapper exposed > to mgr modules in python-land to allow this. Modules can get a handle for > the current osdmap, create an incremental and propose changes to it (osd > weights, upmap entries, crush weights), and apply it to get a new test > osdmap. And I have a preliminary eval vfunction that will analyze the > distribution for a map (original or proposed) so that they can be > compared. In order to make sense of this and test it I made up a simple > interface to interact with it, but I want to run it by people to make sure > it makes sense. > > The basics: > > ceph balancer mode <none,upmap,crush-compat,...> > - which optimiation method to use Regarding the implementation of 'do_osd_weight', can we move the existing 'reweight-by-utilization' and 'reweight-by-pg' to MonCommands from MgrCommands? And then, we can simply send a command to "mon"? Or is there way to call something like "send_command(result, 'mgr', ''...)" ? > ceph balancer on > - run automagically > ceph balancer off > - stop running automagically > ceph balancer status > - see curent mode, any plans, whehter it's enabled > > The useful bits: > > ceph balancer eval > - show analysis of current data distribution > ceph balancer optimize <plan> > - create a new plan to optimize named <plan> based on the current > mode > - ceph balancer status will include a list of plans in memory > (these currently go away if ceph-mgr daemon restarts) > ceph balancer eval <plan> > - analyse resulting distribution if plan is executed > ceph balancer show <plan> > - show what the plan would do (basically a dump of cli commands to > adjust weights etc) > ceph balancer execute <plan> > - execute plan (and then discard it) > ceph balancer rm <plan> > - discard plan > > A normal user will be expected to just set the mode and turn it on: > > ceph balancer mode crush-compat > ceph balancer on > > An advanced user can play with different optimizer modes etc and see what > they will actually do before making any changes to their cluster. > > Does this seem like a reasonable direction for an operator interface? > > -- > > The other part of this exercise is to set up the infrastructure to do the > optimization "right". All of the current code floating around to reweight > by utilization etc is deficient when you do any non-trivial CRUSH things. > I'm trying to get the infrastructure in place from the get-go so that this > will work with multiple roots and device classes. > > There will be some restrictions depending on the mode. Notably, the > crush-compat only has a single set of weights to adjust, so it can't do > much if there are multiple hierarchies being balanced that overlap over > any of the same devices (we should make the balancer refuse to continue in > that case). > > Similarly, we can't do projections and what utilization will look like > with a proposed change when balancing based on actual osd utilization > (what each osd reports as its total usage). Instead, we need to model the > size of each pg so that we can tell how things change when we move pgs. > Initially this will use the pg stats, but that is an incomplete solution > because we don't properly account for omap data. There is also some > storage overhead in the OSD itself (e.g., bluestore metadata, osdmaps, > per-pg metatata). I think eventually we'll probably want to build a model > around pg size based on what the stats say, what the osds report, and a > model for unknown variables (omap cost per pg, per-object overhead, etc). > Until then, we can just make do with the pg stats (should work reasonable > well as long as you're not mixing omap and non-omap pools on the same > devices but via different subtrees). > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Spandan Kumar Sahu IIT Kharagpur -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html