Looks like the mapping need to be generated automatically by some external/internal tools, not by administrator , they can do it handy but I dont think human can properly balance thousands of PGs across thousands of OSDs correctly. Assume we have such tool, can we simplify the "remove" to "regenerate" the mapping? i.e , when osdmap changes, all previous mapping cleared, crush rule applies and pgmap generated, then **balance tool** jump in and generate new mapping, then we have a *balanced* pgmap. > That means either we need to (1) understand the CRUSH rule constraints, or (2) encode the (simplified?) set of placement constraints in the Ceph OSDMap and auto-generate the CRUSH rule from that. will it be simpler by just extend the crushwrapper to have a validate_pg_map(crush *map, vector<int> pg_map)? it just hack the "random" logical in crush_choose and see if crush can generate same mapping as the wanted mapping? Xiaoxi 2017-03-02 6:10 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>: > On Wed, 1 Mar 2017, Dan van der Ster wrote: >> On Wed, Mar 1, 2017 at 8:44 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: >> > There's been a longstanding desire to improve the balance of PGs and data >> > across OSDs to better utilize storage and balance workload. We had a few >> > ideas about this in a meeting last week and I wrote up a summary/proposal >> > here: >> > >> > http://pad.ceph.com/p/osdmap-explicit-mapping >> > >> > The basic idea is to have the ability to explicitly map individual PGs >> > to certain OSDs so that we can move PGs from overfull to underfull >> > devices. The idea is that the mon or mgr would do this based on some >> > heuristics or policy and should result in a better distribution than teh >> > current osd weight adjustments we make now with reweight-by-utilization. >> > >> > The other key property is that one reason why we need as many PGs as we do >> > now is to get a good balance; if we can remap some of them explicitly, we >> > can get a better balance with fewer. In essense, CRUSH gives an >> > approximate distribution, and then we correct to make it perfect (or close >> > to it). >> > >> > The main challenge is less about figuring out when/how to remap PGs to >> > correct balance, but figuring out when to remove those remappings after >> > CRUSH map changes. Some simple greedy strategies are obvious starting >> > points (e.g., to move PGs off OSD X, first adjust or remove existing remap >> > entries targetting OSD X before adding new ones), but there are a few >> > ways we could structure the remap entries themselves so that they >> > more gracefully disappear after a change. >> > >> > For example, a remap entry might move a PG from OSD A to B if it maps to >> > A; if the CRUSH topology changes and the PG no longer maps to A, the entry >> > would be removed or ignored. There are a few ways to do this in the pad; >> > I'm sure there are other options. >> > >> > I put this on the agenda for CDM tonight. If anyone has any other ideas >> > about this we'd love to hear them! >> > >> >> Hi Sage. This would be awesome! Seriously, it would let us run multi >> PB clusters closer to being full -- 1% of imbalance on a 100 PB >> cluster is *expensive* !! >> >> I can't join the meeting, but here are my first thoughts on the implementation. >> >> First, since this is a new feature, it would be cool if it supported >> non-trivial topologies from the beginning -- i.e. the trivial topology >> is when all OSDs should have equal PGs/weight. A non-trivial topology >> is where only OSDs beneath a user-defined crush bucket are balanced. >> >> And something I didn't understand about the remap format options -- >> when the cluster topology changes, couldn't we just remove all >> remappings and start again? If you use some consistent method for the >> remaps, shouldn't the result anyway remain similar after incremental >> topology changes? >> >> In the worst case, remaps will only form maybe 10-20% of PGs. Clearly >> we don't want to shuffle those for small topology changes, but for >> larger changes perhaps we can accept those to move. > > If there is a small change, then we don't want to toss the mappings.. but > if there is a large change you do; which ones you toss should > probably depend on whether the original mapping they were overriding > changed. > > The other hard part of this, now that I think about it, is that you have > placement constraints encoded into the CRUSH rules (e.g., separate > replicas across racks). Whatever is installing new mappings needs to > understand those constraints so tha the remap entries also respect the > policy. That means either we need to (1) understand the CRUSH rule > constraints, or (2) encode the (simplified?) set of placement constraints > in the Ceph OSDMap and auto-generate the CRUSH rule from that. > > For (1), I bet we can make a simple "CRUSH rule intepreter" that, instead > of making pseudorandom choices, pick each child based on the minimum > utilization (or the original mapping value) in order to generate the remap > entries... > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html