Re: explicitly mapping pgs in OSDMap

Sage Weil <sweil@xxxxxxxxxx> · Fri, 3 Mar 2017 14:55:49 +0000 (UTC)

On Fri, 3 Mar 2017, Bartłomiej Święcki wrote:
> Hi Sage,
> 
> The more I think about explicit-only mapping (so clients and OSDs
> not doing CRUSH computations anymore) the more interesting
> use cases I can see, better balance being only one of them.
> 
> Other major advantage I see is that Client would become independent
> of internal CRUSH calculation changes (straw->straw2 was a bit
> problematic on our production for example) which I believe would
> also simplify keeping backward compatibility, especially for the
> kernel stuff. Also it looks like the code would simplify.
> 
> Such explicit mapping could also help with all kinds of cluster
> reconfiguration stuff - i.e. growing cluster by significant amount of
> OSDs could be spread over time to reduce backfill impact, same with
> increasing pgp_num. I also believe peering could gain here too
> because mon/mgr could ensure only limited number of PGs is blocked
> peering at the same time.

Yeah

> About memory overhead - I agree that it is the most important factor to
> consider. I don't know ceph internals that much yet to be authoritative
> here but I believe that in case of cluster rebalance, the amount of data
> that has to be maintained can already grow way beyond whatexplicit
> mapping would need. Is it possible that such explicit map would also
> store list of potential OSDs as long as the PG is not fully recovered?
> That would mean the osdmap history could be trimmed much faster, it
> would be easier to predict mon resource requirements and monitors would
> sync much faster during recovery.

Keeping the old osdmaps around is a somewhat orthogonal problem, and if we 
address that I suspect it'd be a different solution.  There is a 'past 
intervals' PR in flight that makes the OSDs tracking of this more 
efficient, and I suspect we could come up with something that would let us 
trim sooner than we currently do.  It'll take some careful planning, 
though!

sage

> 
> Regards,
> Bartek
> 
> 
> W dniu 02.03.2017 o 19:09, Sage Weil pisze:
> > On Fri, 3 Mar 2017, Tiger Hu wrote:
> > > Hi Sage,
> > > I am very glad to know you raised this issue. In some cases, user may want
> > > to accurately control the PG numbers in every OSDs. Is it possible to
> > > add/implement a new policy to support fixed-mapping? This may be useful
> > > for
> > > performance tuning or test purpose. Thanks.
> > In principle this could be used to map every pg explicitly without regard
> > for CRUSH.  In think in practice, though, we can have CRUSH map 80-90% of
> > the PGs, and then have explicit mappings for 10-20% in order to achieve
> > the same "perfect" balance of PGs.
> > 
> > The result is a smaller OSDMap.  It might not matter that much for small
> > clusters, but for large clusters, it is helpful to keep the OSDMap small.
> > 
> > OTOH, maybe through all of this we end up in a place where the OSDMap is
> > just an explicit map and the mgr agent that's managing it is using CRUSH
> > to generate it; who knows!  That might make for faster mappings on the
> > clients since it's a table lookup instead of a mapping calculation.
> > 
> > I'm thinking that if we implement the ability to have the explicit
> > mappings in the OSDMap we open up both possibilities.  The hard part is
> > always getting the mapping compatbility into the client, so if we do that
> > right, luminous+ clients will support explicit mapping (overrides) in
> > OSDMap and would be able to work with future versions that go all-in on
> > explicit...
> > 
> > sage
> > 
> > 
> >   >
> > > Tiger
> > >        在 2017年3月3日，上午1:40，Sage Weil <sweil@xxxxxxxxxx> 写道：
> > > 
> > > On Thu, 2 Mar 2017, Kamble, Nitin A wrote:
> > >        Hi Sage,
> > >         The crush algorithm handles mapping of pgs, and it will
> > >        even with the
> > >        addition of explicit mappings. I presume, finding which
> > >        pgs belong to
> > >        which OSDs will involve addition computation for each
> > >        additional
> > >        explicit mapping.
> > > 
> > >        What would be penalty of this additional computation?
> > > 
> > >        For small number of explicit mappings such penalty would
> > >        be small,
> > >        IMO it can get quite expensive with large number of
> > >        explicit mappings.
> > >        The implementation will need to manage the count of
> > >        explicit mappings,
> > >        by reverting some of the explicit mappings as the
> > >        distribution changes.
> > >        The understanding of additional overhead of the explicit
> > >        mappings would
> > >        had great influence on the implementation.
> > > 
> > > 
> > > Yeah, we'll want to be careful with the implementation, e.g., by using
> > > an
> > > unordered_map (hash table) for looking up the mappings (an rbtree
> > > won't
> > > scale particularly well).
> > > 
> > > Note that pg_temp is already a map<pg_t,vector<int>> and can get big
> > > when
> > > you're doing a lot of rebalancing, resulting on O(log n) lookups.  If
> > > we
> > > do something optimized here we should use the same strategy for
> > > pg_temp
> > > too.
> > > 
> > > sage
> > > 
> > > 
> > > 
> > >        Nitin
> > > 
> > > 
> > > 
> > > 
> > >        On 3/1/17, 11:44 AM, "ceph-devel-owner@xxxxxxxxxxxxxxx on
> > >        behalf of Sage Weil" <ceph-devel-owner@xxxxxxxxxxxxxxx on
> > >        behalf of sweil@xxxxxxxxxx> wrote:
> > > 
> > >           There's been a longstanding desire to improve the
> > >        balance of PGs and data
> > >           across OSDs to better utilize storage and balance
> > >        workload.  We had a few
> > >           ideas about this in a meeting last week and I wrote up
> > >        a summary/proposal
> > >           here:
> > > 
> > >            http://pad.ceph.com/p/osdmap-explicit-mapping
> > > 
> > >           The basic idea is to have the ability to explicitly map
> > >        individual PGs
> > >           to certain OSDs so that we can move PGs from overfull
> > >        to underfull
> > >           devices.  The idea is that the mon or mgr would do this
> > >        based on some
> > >           heuristics or policy and should result in a better
> > >        distribution than teh
> > >           current osd weight adjustments we make now with
> > >        reweight-by-utilization.
> > > 
> > >           The other key property is that one reason why we need
> > >        as many PGs as we do
> > >           now is to get a good balance; if we can remap some of
> > >        them explicitly, we
> > >           can get a better balance with fewer.  In essense, CRUSH
> > >        gives an
> > >           approximate distribution, and then we correct to make
> > >        it perfect (or close
> > >           to it).
> > > 
> > >           The main challenge is less about figuring out when/how
> > >        to remap PGs to
> > >           correct balance, but figuring out when to remove those
> > >        remappings after
> > >           CRUSH map changes.  Some simple greedy strategies are
> > >        obvious starting
> > >           points (e.g., to move PGs off OSD X, first adjust or
> > >        remove existing remap
> > >           entries targetting OSD X before adding new ones), but
> > >        there are a few
> > >           ways we could structure the remap entries themselves so
> > >        that they
> > >           more gracefully disappear after a change.
> > > 
> > >           For example, a remap entry might move a PG from OSD A
> > >        to B if it maps to
> > >           A; if the CRUSH topology changes and the PG no longer
> > >        maps to A, the entry
> > >           would be removed or ignored.  There are a few ways to
> > >        do this in the pad;
> > >           I'm sure there are other options.
> > > 
> > >           I put this on the agenda for CDM tonight.  If anyone
> > >        has any other ideas
> > >           about this we'd love to hear them!
> > > 
> > >           sage
> > >           --
> > >           To unsubscribe from this list: send the line
> > >        "unsubscribe ceph-devel" in
> > >           the body of a message to majordomo@xxxxxxxxxxxxxxx
> > >           More majordomo info at
> > >         http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>