Hi all, I've been working off and on on a mgr module 'balancer' that will do automatically optimization of the pg distribution. The idea is you'll eventually be able to just turn it on and it will slowly and continuously optimize the layout without having to think about it. I got something basic implemented pretty quickly that wraps around the new pg-upmap optimizer embedded in OSDMap.cc and osdmaptool. And I had something that adjust the compat weight-set (optimizing crush weights in a backward-compatible way) that sort of kind of worked, but its problem was that it worked against the actual cluster instead of a model of the cluster, which meant it didn't always know whether a change it was making was going to be a good one until it tried it (and moved a bunch of data round). The conclusion from that was that the optmizer, regardless of what method it was using (upmap, crush weights, osd weights) had to operate against a model of the system so that it could check whether its changes were good ones before making them. I got enough of the OSDMap, OSDMap::Incremental, and CrushWrapper exposed to mgr modules in python-land to allow this. Modules can get a handle for the current osdmap, create an incremental and propose changes to it (osd weights, upmap entries, crush weights), and apply it to get a new test osdmap. And I have a preliminary eval vfunction that will analyze the distribution for a map (original or proposed) so that they can be compared. In order to make sense of this and test it I made up a simple interface to interact with it, but I want to run it by people to make sure it makes sense. The basics: ceph balancer mode <none,upmap,crush-compat,...> - which optimiation method to use ceph balancer on - run automagically ceph balancer off - stop running automagically ceph balancer status - see curent mode, any plans, whehter it's enabled The useful bits: ceph balancer eval - show analysis of current data distribution ceph balancer optimize <plan> - create a new plan to optimize named <plan> based on the current mode - ceph balancer status will include a list of plans in memory (these currently go away if ceph-mgr daemon restarts) ceph balancer eval <plan> - analyse resulting distribution if plan is executed ceph balancer show <plan> - show what the plan would do (basically a dump of cli commands to adjust weights etc) ceph balancer execute <plan> - execute plan (and then discard it) ceph balancer rm <plan> - discard plan A normal user will be expected to just set the mode and turn it on: ceph balancer mode crush-compat ceph balancer on An advanced user can play with different optimizer modes etc and see what they will actually do before making any changes to their cluster. Does this seem like a reasonable direction for an operator interface? -- The other part of this exercise is to set up the infrastructure to do the optimization "right". All of the current code floating around to reweight by utilization etc is deficient when you do any non-trivial CRUSH things. I'm trying to get the infrastructure in place from the get-go so that this will work with multiple roots and device classes. There will be some restrictions depending on the mode. Notably, the crush-compat only has a single set of weights to adjust, so it can't do much if there are multiple hierarchies being balanced that overlap over any of the same devices (we should make the balancer refuse to continue in that case). Similarly, we can't do projections and what utilization will look like with a proposed change when balancing based on actual osd utilization (what each osd reports as its total usage). Instead, we need to model the size of each pg so that we can tell how things change when we move pgs. Initially this will use the pg stats, but that is an incomplete solution because we don't properly account for omap data. There is also some storage overhead in the OSD itself (e.g., bluestore metadata, osdmaps, per-pg metatata). I think eventually we'll probably want to build a model around pg size based on what the stats say, what the osds report, and a model for unknown variables (omap cost per pg, per-object overhead, etc). Until then, we can just make do with the pg stats (should work reasonable well as long as you're not mixing omap and non-omap pools on the same devices but via different subtrees). sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html