Re: mgr balancer module

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, 29 Jul 2017, Spandan Kumar Sahu wrote:
> On Fri, Jul 28, 2017 at 9:21 AM, Sage Weil <sage@xxxxxxxxxx> wrote:
> > Hi all,
> >
> > I've been working off and on on a mgr module 'balancer' that will do
> > automatically optimization of the pg distribution.  The idea is you'll
> > eventually be able to just turn it on and it will slowly and continuously
> > optimize the layout without having to think about it.
> >
> > I got something basic implemented pretty quickly that wraps around the new
> > pg-upmap optimizer embedded in OSDMap.cc and osdmaptool.  And I had
> > something that adjust the compat weight-set (optimizing crush weights in a
> > backward-compatible way) that sort of kind of worked, but its problem was
> > that it worked against the actual cluster instead of a model of the
> > cluster, which meant it didn't always know whether a change it was making
> > was going to be a good one until it tried it (and moved a bunch of data
> > round). The conclusion from that was that the optmizer, regardless of what
> > method it was using (upmap, crush weights, osd weights) had to operate
> > against a model of the system so that it could check whether its changes
> > were good ones before making them.
> >
> > I got enough of the OSDMap, OSDMap::Incremental, and CrushWrapper exposed
> > to mgr modules in python-land to allow this.  Modules can get a handle for
> > the current osdmap, create an incremental and propose changes to it (osd
> > weights, upmap entries, crush weights), and apply it to get a new test
> > osdmap.  And I have a preliminary eval vfunction that will analyze the
> > distribution for a map (original or proposed) so that they can be
> > compared.  In order to make sense of this and test it I made up a simple
> > interface to interact with it, but I want to run it by people to make sure
> > it makes sense.
> >
> > The basics:
> >
> >   ceph balancer mode <none,upmap,crush-compat,...>
> >         - which optimiation method to use
> 
> Regarding the implementation of 'do_osd_weight', can we move the
> existing 'reweight-by-utilization' and 'reweight-by-pg' to MonCommands
> from MgrCommands? And then, we can simply send a command to "mon"? Or
> is there way to call something like "send_command(result, 'mgr',
> ''...)" ?

Yes and no... I think the inner loop doing the arithmetic can be copied, 
but part of what I've done so far in balancer has built most (I think) of 
the surrounding infrastructure so that we are reweighting the right osds 
to match the right target distribution.  The current reweight-by-* doesn't 
understand multiple crush rules/roots (which are easy to create now with 
the new device classes).  It should be pretty easy to slot it in now...

sage

 > 
> >   ceph balancer on
> >         - run automagically
> >   ceph balancer off
> >         - stop running automagically
> >   ceph balancer status
> >         - see curent mode, any plans, whehter it's enabled
> >
> > The useful bits:
> >
> >   ceph balancer eval
> >         - show analysis of current data distribution
> >   ceph balancer optimize <plan>
> >         - create a new plan to optimize named <plan> based on the current
> >           mode
> >         - ceph balancer status will include a list of plans in memory
> >           (these currently go away if ceph-mgr daemon restarts)
> >   ceph balancer eval <plan>
> >         - analyse resulting distribution if plan is executed
> >   ceph balancer show <plan>
> >         - show what the plan would do (basically a dump of cli commands to
> >           adjust weights etc)
> >   ceph balancer execute <plan>
> >         - execute plan (and then discard it)
> >   ceph balancer rm <plan>
> >         - discard plan
> >
> > A normal user will be expected to just set the mode and turn it on:
> >
> >   ceph balancer mode crush-compat
> >   ceph balancer on
> >
> > An advanced user can play with different optimizer modes etc and see what
> > they will actually do before making any changes to their cluster.
> >
> > Does this seem like a reasonable direction for an operator interface?
> >
> > --
> >
> > The other part of this exercise is to set up the infrastructure to do the
> > optimization "right".  All of the current code floating around to reweight
> > by utilization etc is deficient when you do any non-trivial CRUSH things.
> > I'm trying to get the infrastructure in place from the get-go so that this
> > will work with multiple roots and device classes.
> >
> > There will be some restrictions depending on the mode.  Notably, the
> > crush-compat only has a single set of weights to adjust, so it can't do
> > much if there are multiple hierarchies being balanced that overlap over
> > any of the same devices (we should make the balancer refuse to continue in
> > that case).
> >
> > Similarly, we can't do projections and what utilization will look like
> > with a proposed change when balancing based on actual osd utilization
> > (what each osd reports as its total usage).  Instead, we need to model the
> > size of each pg so that we can tell how things change when we move pgs.
> > Initially this will use the pg stats, but that is an incomplete solution
> > because we don't properly account for omap data.  There is also some
> > storage overhead in the OSD itself (e.g., bluestore metadata, osdmaps,
> > per-pg metatata).  I think eventually we'll probably want to build a model
> > around pg size based on what the stats say, what the osds report, and a
> > model for unknown variables (omap cost per pg, per-object overhead, etc).
> > Until then, we can just make do with the pg stats (should work reasonable
> > well as long as you're not mixing omap and non-omap pools on the same
> > devices but via different subtrees).
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Spandan Kumar Sahu
> IIT Kharagpur
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux