RE: explicitly mapping pgs in OSDMap

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Wed, 1 Mar 2017 23:18:12 +0000

> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil
> Sent: Wednesday, March 01, 2017 2:11 PM
> To: Dan van der Ster <dan@xxxxxxxxxxxxxx>
> Cc: ceph-devel@xxxxxxxxxxxxxxx
> Subject: Re: explicitly mapping pgs in OSDMap
> 
> On Wed, 1 Mar 2017, Dan van der Ster wrote:
> > On Wed, Mar 1, 2017 at 8:44 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > > There's been a longstanding desire to improve the balance of PGs and
> > > data across OSDs to better utilize storage and balance workload.  We
> > > had a few ideas about this in a meeting last week and I wrote up a
> > > summary/proposal
> > > here:
> > >
> > >         http://pad.ceph.com/p/osdmap-explicit-mapping
> > >
> > > The basic idea is to have the ability to explicitly map individual
> > > PGs to certain OSDs so that we can move PGs from overfull to
> > > underfull devices.  The idea is that the mon or mgr would do this
> > > based on some heuristics or policy and should result in a better
> > > distribution than teh current osd weight adjustments we make now with
> reweight-by-utilization.
> > >
> > > The other key property is that one reason why we need as many PGs as
> > > we do now is to get a good balance; if we can remap some of them
> > > explicitly, we can get a better balance with fewer.  In essense,
> > > CRUSH gives an approximate distribution, and then we correct to make
> > > it perfect (or close to it).
> > >
> > > The main challenge is less about figuring out when/how to remap PGs
> > > to correct balance, but figuring out when to remove those remappings
> > > after CRUSH map changes.  Some simple greedy strategies are obvious
> > > starting points (e.g., to move PGs off OSD X, first adjust or remove
> > > existing remap entries targetting OSD X before adding new ones), but
> > > there are a few ways we could structure the remap entries themselves
> > > so that they more gracefully disappear after a change.
> > >
> > > For example, a remap entry might move a PG from OSD A to B if it
> > > maps to A; if the CRUSH topology changes and the PG no longer maps
> > > to A, the entry would be removed or ignored.  There are a few ways
> > > to do this in the pad; I'm sure there are other options.
> > >
> > > I put this on the agenda for CDM tonight.  If anyone has any other
> > > ideas about this we'd love to hear them!
> > >
> >
> > Hi Sage. This would be awesome! Seriously, it would let us run multi
> > PB clusters closer to being full -- 1% of imbalance on a 100 PB
> > cluster is *expensive* !!

Whoa there. This will reduce the variance between fullness on each OSD, which is a really good thing.

However, you will still have a dropoff in individual OSD performance as it fills up. Nothing here addresses that.

> >
> > I can't join the meeting, but here are my first thoughts on the
> implementation.
> >
> > First, since this is a new feature, it would be cool if it supported
> > non-trivial topologies from the beginning -- i.e. the trivial topology
> > is when all OSDs should have equal PGs/weight. A non-trivial topology
> > is where only OSDs beneath a user-defined crush bucket are balanced.
> >
> > And something I didn't understand about the remap format options --
> > when the cluster topology changes, couldn't we just remove all
> > remappings and start again? If you use some consistent method for the
> > remaps, shouldn't the result anyway remain similar after incremental
> > topology changes?
> >
> > In the worst case, remaps will only form maybe 10-20% of PGs. Clearly
> > we don't want to shuffle those for small topology changes, but for
> > larger changes perhaps we can accept those to move.
> 
> If there is a small change, then we don't want to toss the mappings.. but if
> there is a large change you do; which ones you toss should probably depend
> on whether the original mapping they were overriding changed.
> 
> The other hard part of this, now that I think about it, is that you have
> placement constraints encoded into the CRUSH rules (e.g., separate replicas
> across racks).  Whatever is installing new mappings needs to understand
> those constraints so tha the remap entries also respect the policy.  That
> means either we need to (1) understand the CRUSH rule constraints, or (2)
> encode the (simplified?) set of placement constraints in the Ceph OSDMap
> and auto-generate the CRUSH rule from that.
> 
> For (1), I bet we can make a simple "CRUSH rule intepreter" that, instead of
> making pseudorandom choices, pick each child based on the minimum
> utilization (or the original mapping value) in order to generate the remap
> entries...
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html