On Fri, 3 Mar 2017, Bartłomiej Święcki wrote: > Hi Sage, > > The more I think about explicit-only mapping (so clients and OSDs > not doing CRUSH computations anymore) the more interesting > use cases I can see, better balance being only one of them. > > Other major advantage I see is that Client would become independent > of internal CRUSH calculation changes (straw->straw2 was a bit > problematic on our production for example) which I believe would > also simplify keeping backward compatibility, especially for the > kernel stuff. Also it looks like the code would simplify. > > Such explicit mapping could also help with all kinds of cluster > reconfiguration stuff - i.e. growing cluster by significant amount of > OSDs could be spread over time to reduce backfill impact, same with > increasing pgp_num. I also believe peering could gain here too > because mon/mgr could ensure only limited number of PGs is blocked > peering at the same time. Yeah > About memory overhead - I agree that it is the most important factor to > consider. I don't know ceph internals that much yet to be authoritative > here but I believe that in case of cluster rebalance, the amount of data > that has to be maintained can already grow way beyond whatexplicit > mapping would need. Is it possible that such explicit map would also > store list of potential OSDs as long as the PG is not fully recovered? > That would mean the osdmap history could be trimmed much faster, it > would be easier to predict mon resource requirements and monitors would > sync much faster during recovery. Keeping the old osdmaps around is a somewhat orthogonal problem, and if we address that I suspect it'd be a different solution. There is a 'past intervals' PR in flight that makes the OSDs tracking of this more efficient, and I suspect we could come up with something that would let us trim sooner than we currently do. It'll take some careful planning, though! sage > > Regards, > Bartek > > > W dniu 02.03.2017 o 19:09, Sage Weil pisze: > > On Fri, 3 Mar 2017, Tiger Hu wrote: > > > Hi Sage, > > > I am very glad to know you raised this issue. In some cases, user may want > > > to accurately control the PG numbers in every OSDs. Is it possible to > > > add/implement a new policy to support fixed-mapping? This may be useful > > > for > > > performance tuning or test purpose. Thanks. > > In principle this could be used to map every pg explicitly without regard > > for CRUSH. In think in practice, though, we can have CRUSH map 80-90% of > > the PGs, and then have explicit mappings for 10-20% in order to achieve > > the same "perfect" balance of PGs. > > > > The result is a smaller OSDMap. It might not matter that much for small > > clusters, but for large clusters, it is helpful to keep the OSDMap small. > > > > OTOH, maybe through all of this we end up in a place where the OSDMap is > > just an explicit map and the mgr agent that's managing it is using CRUSH > > to generate it; who knows! That might make for faster mappings on the > > clients since it's a table lookup instead of a mapping calculation. > > > > I'm thinking that if we implement the ability to have the explicit > > mappings in the OSDMap we open up both possibilities. The hard part is > > always getting the mapping compatbility into the client, so if we do that > > right, luminous+ clients will support explicit mapping (overrides) in > > OSDMap and would be able to work with future versions that go all-in on > > explicit... > > > > sage > > > > > > > > > > Tiger > > > 在 2017年3月3日,上午1:40,Sage Weil <sweil@xxxxxxxxxx> 写道: > > > > > > On Thu, 2 Mar 2017, Kamble, Nitin A wrote: > > > Hi Sage, > > > The crush algorithm handles mapping of pgs, and it will > > > even with the > > > addition of explicit mappings. I presume, finding which > > > pgs belong to > > > which OSDs will involve addition computation for each > > > additional > > > explicit mapping. > > > > > > What would be penalty of this additional computation? > > > > > > For small number of explicit mappings such penalty would > > > be small, > > > IMO it can get quite expensive with large number of > > > explicit mappings. > > > The implementation will need to manage the count of > > > explicit mappings, > > > by reverting some of the explicit mappings as the > > > distribution changes. > > > The understanding of additional overhead of the explicit > > > mappings would > > > had great influence on the implementation. > > > > > > > > > Yeah, we'll want to be careful with the implementation, e.g., by using > > > an > > > unordered_map (hash table) for looking up the mappings (an rbtree > > > won't > > > scale particularly well). > > > > > > Note that pg_temp is already a map<pg_t,vector<int>> and can get big > > > when > > > you're doing a lot of rebalancing, resulting on O(log n) lookups. If > > > we > > > do something optimized here we should use the same strategy for > > > pg_temp > > > too. > > > > > > sage > > > > > > > > > > > > Nitin > > > > > > > > > > > > > > > On 3/1/17, 11:44 AM, "ceph-devel-owner@xxxxxxxxxxxxxxx on > > > behalf of Sage Weil" <ceph-devel-owner@xxxxxxxxxxxxxxx on > > > behalf of sweil@xxxxxxxxxx> wrote: > > > > > > There's been a longstanding desire to improve the > > > balance of PGs and data > > > across OSDs to better utilize storage and balance > > > workload. We had a few > > > ideas about this in a meeting last week and I wrote up > > > a summary/proposal > > > here: > > > > > > http://pad.ceph.com/p/osdmap-explicit-mapping > > > > > > The basic idea is to have the ability to explicitly map > > > individual PGs > > > to certain OSDs so that we can move PGs from overfull > > > to underfull > > > devices. The idea is that the mon or mgr would do this > > > based on some > > > heuristics or policy and should result in a better > > > distribution than teh > > > current osd weight adjustments we make now with > > > reweight-by-utilization. > > > > > > The other key property is that one reason why we need > > > as many PGs as we do > > > now is to get a good balance; if we can remap some of > > > them explicitly, we > > > can get a better balance with fewer. In essense, CRUSH > > > gives an > > > approximate distribution, and then we correct to make > > > it perfect (or close > > > to it). > > > > > > The main challenge is less about figuring out when/how > > > to remap PGs to > > > correct balance, but figuring out when to remove those > > > remappings after > > > CRUSH map changes. Some simple greedy strategies are > > > obvious starting > > > points (e.g., to move PGs off OSD X, first adjust or > > > remove existing remap > > > entries targetting OSD X before adding new ones), but > > > there are a few > > > ways we could structure the remap entries themselves so > > > that they > > > more gracefully disappear after a change. > > > > > > For example, a remap entry might move a PG from OSD A > > > to B if it maps to > > > A; if the CRUSH topology changes and the PG no longer > > > maps to A, the entry > > > would be removed or ignored. There are a few ways to > > > do this in the pad; > > > I'm sure there are other options. > > > > > > I put this on the agenda for CDM tonight. If anyone > > > has any other ideas > > > about this we'd love to hear them! > > > > > > sage > > > -- > > > To unsubscribe from this list: send the line > > > "unsubscribe ceph-devel" in > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > More majordomo info at > > > http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > in > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > >