Re: explicitly mapping pgs in OSDMap

Sage Weil <sweil@xxxxxxxxxx> · Thu, 2 Mar 2017 17:40:36 +0000 (UTC)

On Thu, 2 Mar 2017, Kamble, Nitin A wrote:
> Hi Sage,
>   The crush algorithm handles mapping of pgs, and it will even with the
> addition of explicit mappings. I presume, finding which pgs belong to
> which OSDs will involve addition computation for each additional
> explicit mapping. 
> 
> What would be penalty of this additional computation? 
> 
> For small number of explicit mappings such penalty would be small, 
> IMO it can get quite expensive with large number of explicit mappings.
> The implementation will need to manage the count of explicit mappings,
> by reverting some of the explicit mappings as the distribution changes.
> The understanding of additional overhead of the explicit mappings would
> had great influence on the implementation.

Yeah, we'll want to be careful with the implementation, e.g., by using an 
unordered_map (hash table) for looking up the mappings (an rbtree won't 
scale particularly well).

Note that pg_temp is already a map<pg_t,vector<int>> and can get big when 
you're doing a lot of rebalancing, resulting on O(log n) lookups.  If we 
do something optimized here we should use the same strategy for pg_temp 
too.

sage

> 
> Nitin
> 
> 
> 
> 
> On 3/1/17, 11:44 AM, "ceph-devel-owner@xxxxxxxxxxxxxxx on behalf of Sage Weil" <ceph-devel-owner@xxxxxxxxxxxxxxx on behalf of sweil@xxxxxxxxxx> wrote:
> 
>     There's been a longstanding desire to improve the balance of PGs and data 
>     across OSDs to better utilize storage and balance workload.  We had a few 
>     ideas about this in a meeting last week and I wrote up a summary/proposal 
>     here:
>     
>     	http://pad.ceph.com/p/osdmap-explicit-mapping
>     
>     The basic idea is to have the ability to explicitly map individual PGs 
>     to certain OSDs so that we can move PGs from overfull to underfull 
>     devices.  The idea is that the mon or mgr would do this based on some 
>     heuristics or policy and should result in a better distribution than teh 
>     current osd weight adjustments we make now with reweight-by-utilization.
>     
>     The other key property is that one reason why we need as many PGs as we do 
>     now is to get a good balance; if we can remap some of them explicitly, we 
>     can get a better balance with fewer.  In essense, CRUSH gives an 
>     approximate distribution, and then we correct to make it perfect (or close 
>     to it).
>     
>     The main challenge is less about figuring out when/how to remap PGs to 
>     correct balance, but figuring out when to remove those remappings after 
>     CRUSH map changes.  Some simple greedy strategies are obvious starting 
>     points (e.g., to move PGs off OSD X, first adjust or remove existing remap 
>     entries targetting OSD X before adding new ones), but there are a few 
>     ways we could structure the remap entries themselves so that they 
>     more gracefully disappear after a change.
>     
>     For example, a remap entry might move a PG from OSD A to B if it maps to 
>     A; if the CRUSH topology changes and the PG no longer maps to A, the entry 
>     would be removed or ignored.  There are a few ways to do this in the pad; 
>     I'm sure there are other options.
>     
>     I put this on the agenda for CDM tonight.  If anyone has any other ideas 
>     about this we'd love to hear them!
>     
>     sage
>     --
>     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>     the body of a message to majordomo@xxxxxxxxxxxxxxx
>     More majordomo info at  http://vger.kernel.org/majordomo-info.html
>     
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html