Re: explicitly mapping pgs in OSDMap

Tiger Hu <hufh2004@xxxxxxxxx> · Fri, 3 Mar 2017 03:40:57 +0800

Sage,

Thanks for your reply. Looking forward to explicit mapping.

Tiger
> 在 2017年3月3日，上午2:09，Sage Weil <sweil@xxxxxxxxxx> 写道：
> 
> On Fri, 3 Mar 2017, Tiger Hu wrote:
>> Hi Sage,
>> I am very glad to know you raised this issue. In some cases, user may want
>> to accurately control the PG numbers in every OSDs. Is it possible to
>> add/implement a new policy to support fixed-mapping? This may be useful for
>> performance tuning or test purpose. Thanks.
> 
> In principle this could be used to map every pg explicitly without regard 
> for CRUSH.  In think in practice, though, we can have CRUSH map 80-90% of 
> the PGs, and then have explicit mappings for 10-20% in order to achieve 
> the same "perfect" balance of PGs.
> 
> The result is a smaller OSDMap.  It might not matter that much for small 
> clusters, but for large clusters, it is helpful to keep the OSDMap small.
> 
> OTOH, maybe through all of this we end up in a place where the OSDMap is 
> just an explicit map and the mgr agent that's managing it is using CRUSH 
> to generate it; who knows!  That might make for faster mappings on the 
> clients since it's a table lookup instead of a mapping calculation.
> 
> I'm thinking that if we implement the ability to have the explicit 
> mappings in the OSDMap we open up both possibilities.  The hard part is 
> always getting the mapping compatbility into the client, so if we do that 
> right, luminous+ clients will support explicit mapping (overrides) in 
> OSDMap and would be able to work with future versions that go all-in on 
> explicit...
> 
> sage
> 
> 
>> 
>> Tiger
>>      在 2017年3月3日，上午1:40，Sage Weil <sweil@xxxxxxxxxx> 写道：
>> 
>> On Thu, 2 Mar 2017, Kamble, Nitin A wrote:
>>      Hi Sage,
>>       The crush algorithm handles mapping of pgs, and it will
>>      even with the
>>      addition of explicit mappings. I presume, finding which
>>      pgs belong to
>>      which OSDs will involve addition computation for each
>>      additional
>>      explicit mapping. 
>> 
>>      What would be penalty of this additional computation? 
>> 
>>      For small number of explicit mappings such penalty would
>>      be small, 
>>      IMO it can get quite expensive with large number of
>>      explicit mappings.
>>      The implementation will need to manage the count of
>>      explicit mappings,
>>      by reverting some of the explicit mappings as the
>>      distribution changes.
>>      The understanding of additional overhead of the explicit
>>      mappings would
>>      had great influence on the implementation.
>> 
>> 
>> Yeah, we'll want to be careful with the implementation, e.g., by using
>> an 
>> unordered_map (hash table) for looking up the mappings (an rbtree
>> won't 
>> scale particularly well).
>> 
>> Note that pg_temp is already a map<pg_t,vector<int>> and can get big
>> when 
>> you're doing a lot of rebalancing, resulting on O(log n) lookups.  If
>> we 
>> do something optimized here we should use the same strategy for
>> pg_temp 
>> too.
>> 
>> sage
>> 
>> 
>> 
>>      Nitin
>> 
>> 
>> 
>> 
>>      On 3/1/17, 11:44 AM, "ceph-devel-owner@xxxxxxxxxxxxxxx on
>>      behalf of Sage Weil" <ceph-devel-owner@xxxxxxxxxxxxxxx on
>>      behalf of sweil@xxxxxxxxxx> wrote:
>> 
>>         There's been a longstanding desire to improve the
>>      balance of PGs and data 
>>         across OSDs to better utilize storage and balance
>>      workload.  We had a few 
>>         ideas about this in a meeting last week and I wrote up
>>      a summary/proposal 
>>         here:
>> 
>>          http://pad.ceph.com/p/osdmap-explicit-mapping
>> 
>>         The basic idea is to have the ability to explicitly map
>>      individual PGs 
>>         to certain OSDs so that we can move PGs from overfull
>>      to underfull 
>>         devices.  The idea is that the mon or mgr would do this
>>      based on some 
>>         heuristics or policy and should result in a better
>>      distribution than teh 
>>         current osd weight adjustments we make now with
>>      reweight-by-utilization.
>> 
>>         The other key property is that one reason why we need
>>      as many PGs as we do 
>>         now is to get a good balance; if we can remap some of
>>      them explicitly, we 
>>         can get a better balance with fewer.  In essense, CRUSH
>>      gives an 
>>         approximate distribution, and then we correct to make
>>      it perfect (or close 
>>         to it).
>> 
>>         The main challenge is less about figuring out when/how
>>      to remap PGs to 
>>         correct balance, but figuring out when to remove those
>>      remappings after 
>>         CRUSH map changes.  Some simple greedy strategies are
>>      obvious starting 
>>         points (e.g., to move PGs off OSD X, first adjust or
>>      remove existing remap 
>>         entries targetting OSD X before adding new ones), but
>>      there are a few 
>>         ways we could structure the remap entries themselves so
>>      that they 
>>         more gracefully disappear after a change.
>> 
>>         For example, a remap entry might move a PG from OSD A
>>      to B if it maps to 
>>         A; if the CRUSH topology changes and the PG no longer
>>      maps to A, the entry 
>>         would be removed or ignored.  There are a few ways to
>>      do this in the pad; 
>>         I'm sure there are other options.
>> 
>>         I put this on the agenda for CDM tonight.  If anyone
>>      has any other ideas 
>>         about this we'd love to hear them!
>> 
>>         sage
>>         --
>>         To unsubscribe from this list: send the line
>>      "unsubscribe ceph-devel" in
>>         the body of a message to majordomo@xxxxxxxxxxxxxxx
>>         More majordomo info at
>>       http://vger.kernel.org/majordomo-info.html
>> 
>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 
>> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html