Sage, Thanks for your reply. Looking forward to explicit mapping. Tiger > 在 2017年3月3日,上午2:09,Sage Weil <sweil@xxxxxxxxxx> 写道: > > On Fri, 3 Mar 2017, Tiger Hu wrote: >> Hi Sage, >> I am very glad to know you raised this issue. In some cases, user may want >> to accurately control the PG numbers in every OSDs. Is it possible to >> add/implement a new policy to support fixed-mapping? This may be useful for >> performance tuning or test purpose. Thanks. > > In principle this could be used to map every pg explicitly without regard > for CRUSH. In think in practice, though, we can have CRUSH map 80-90% of > the PGs, and then have explicit mappings for 10-20% in order to achieve > the same "perfect" balance of PGs. > > The result is a smaller OSDMap. It might not matter that much for small > clusters, but for large clusters, it is helpful to keep the OSDMap small. > > OTOH, maybe through all of this we end up in a place where the OSDMap is > just an explicit map and the mgr agent that's managing it is using CRUSH > to generate it; who knows! That might make for faster mappings on the > clients since it's a table lookup instead of a mapping calculation. > > I'm thinking that if we implement the ability to have the explicit > mappings in the OSDMap we open up both possibilities. The hard part is > always getting the mapping compatbility into the client, so if we do that > right, luminous+ clients will support explicit mapping (overrides) in > OSDMap and would be able to work with future versions that go all-in on > explicit... > > sage > > >> >> Tiger >> 在 2017年3月3日,上午1:40,Sage Weil <sweil@xxxxxxxxxx> 写道: >> >> On Thu, 2 Mar 2017, Kamble, Nitin A wrote: >> Hi Sage, >> The crush algorithm handles mapping of pgs, and it will >> even with the >> addition of explicit mappings. I presume, finding which >> pgs belong to >> which OSDs will involve addition computation for each >> additional >> explicit mapping. >> >> What would be penalty of this additional computation? >> >> For small number of explicit mappings such penalty would >> be small, >> IMO it can get quite expensive with large number of >> explicit mappings. >> The implementation will need to manage the count of >> explicit mappings, >> by reverting some of the explicit mappings as the >> distribution changes. >> The understanding of additional overhead of the explicit >> mappings would >> had great influence on the implementation. >> >> >> Yeah, we'll want to be careful with the implementation, e.g., by using >> an >> unordered_map (hash table) for looking up the mappings (an rbtree >> won't >> scale particularly well). >> >> Note that pg_temp is already a map<pg_t,vector<int>> and can get big >> when >> you're doing a lot of rebalancing, resulting on O(log n) lookups. If >> we >> do something optimized here we should use the same strategy for >> pg_temp >> too. >> >> sage >> >> >> >> Nitin >> >> >> >> >> On 3/1/17, 11:44 AM, "ceph-devel-owner@xxxxxxxxxxxxxxx on >> behalf of Sage Weil" <ceph-devel-owner@xxxxxxxxxxxxxxx on >> behalf of sweil@xxxxxxxxxx> wrote: >> >> There's been a longstanding desire to improve the >> balance of PGs and data >> across OSDs to better utilize storage and balance >> workload. We had a few >> ideas about this in a meeting last week and I wrote up >> a summary/proposal >> here: >> >> http://pad.ceph.com/p/osdmap-explicit-mapping >> >> The basic idea is to have the ability to explicitly map >> individual PGs >> to certain OSDs so that we can move PGs from overfull >> to underfull >> devices. The idea is that the mon or mgr would do this >> based on some >> heuristics or policy and should result in a better >> distribution than teh >> current osd weight adjustments we make now with >> reweight-by-utilization. >> >> The other key property is that one reason why we need >> as many PGs as we do >> now is to get a good balance; if we can remap some of >> them explicitly, we >> can get a better balance with fewer. In essense, CRUSH >> gives an >> approximate distribution, and then we correct to make >> it perfect (or close >> to it). >> >> The main challenge is less about figuring out when/how >> to remap PGs to >> correct balance, but figuring out when to remove those >> remappings after >> CRUSH map changes. Some simple greedy strategies are >> obvious starting >> points (e.g., to move PGs off OSD X, first adjust or >> remove existing remap >> entries targetting OSD X before adding new ones), but >> there are a few >> ways we could structure the remap entries themselves so >> that they >> more gracefully disappear after a change. >> >> For example, a remap entry might move a PG from OSD A >> to B if it maps to >> A; if the CRUSH topology changes and the PG no longer >> maps to A, the entry >> would be removed or ignored. There are a few ways to >> do this in the pad; >> I'm sure there are other options. >> >> I put this on the agenda for CDM tonight. If anyone >> has any other ideas >> about this we'd love to hear them! >> >> sage >> -- >> To unsubscribe from this list: send the line >> "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at >> http://vger.kernel.org/majordomo-info.html >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >> in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html