Re: Disabling CRUSH for erasure code and doing custom placement

Sage Weil <sweil@xxxxxxxxxx> · Tue, 24 Jun 2014 13:48:24 -0700 (PDT)

I wonder if what *would* make some sense here would be to add an exception 
map to OSDMap similar to pg_temp, but called pg_force (or similar) that is 
a persistent, forced mapping of a pg to a value.  This would, in 
principle, let you force a mapping for every pg and have no (or an empty) 
CRUSH map.

The main thing I would do differently there from pg_temp would be to have 
a priority type field for each mapping so that tools can distinguish 
between things that automated scripts set vs an admin set vs whatever 
else.  Right now the single-level pg_temp remapping doesn't let you do 
that (it is always "owned" by the OSDs peering process, effectively); 
there is a similar subtlety to the OSDMap weights (which may be set by an 
admin or by reweight-by-utilization, for example).

What does everyone thing?

sage

On Tue, 24 Jun 2014, Gregory Farnum wrote:

> On Tue, Jun 24, 2014 at 9:12 AM, Shayan Saeed <shayansaeed93@xxxxxxxxx> wrote:
> > I assumed that creating a large number of pools might not be scalable.
> > If there is no overhead in creating as many pools as I want within an
> > OSD, I would probably choose this option.
> 
> There is an overhead per-PG, and pools create PGs, but OSDs expect to
> hold hundreds, and can generally handle several thousands.
> 
> > I just want to specify that
> > systematic chunks should be among 'a' racks while distribute others
> > among 'b' racks. The only problem is that I want to do this for every
> > incoming file (the k and m for erasure coded files can vary too) to
> > the cluster and while there are around 10 racks, the various
> > combinations might grow to be quite large which would make CRUSH map
> > file huge.
> 
> Well, you specify the EC rules to use on a per-pool basis. You
> *really* aren't going to be able to change this so that a pool
> contains objects of different encoding schemes; the encoding is
> inherent in how many OSDs are members of the PG, etc.
> However, it's quite simple to specify a group of OSDs which are used
> for the data chunks, and a separate group of OSDs used for the parity
> chunks. Just set up separate CRUSH map roots for each, and then do
> multiple take...emit steps within the rule.
> 
> 
> > Would this affect my performance if the number of pools,
> > CRUSH rules grows abnormally large?
> >
> > I might go for this option if there is no prohibitive trade off and/or
> > changing the source code for this proves really challenging.
> 
> The source changes you're talking about will prove really challenging. ;)
> -Greg
> 
> >
> > Regards,
> > Shayan Saeed
> >
> >
> > On Tue, Jun 24, 2014 at 11:37 AM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
> >> On Tue, Jun 24, 2014 at 8:29 AM, Shayan Saeed <shayansaeed93@xxxxxxxxx> wrote:
> >>> Hi,
> >>>
> >>> CRUSH placement algorithm works really nice with replication. However,
> >>> with erasure code, my cluster has some issues which require making
> >>> changes that I cannot specify with CRUSH maps.
> >>> Sometimes, depending on the type of data, I would like to place them
> >>> on different OSDs but in the same pool.
> >>
> >> Why do you want to keep the data in the same pool?
> >>
> >>>
> >>> I realize that to disable the CRUSH placement algorithm and replacing
> >>> it with my own custom algorithm, such as random placement algo or any
> >>> other, I have to make changes in the source code. I want to ask if
> >>> there is an easy way to do this without going into every code file and
> >>> looking where the mapping from objects to PG is done and changing
> >>> that. Is there some configuration option which disables crush and
> >>> points to my own placement algo file for doing custom placement.
> >>
> >> What you're asking for really doesn't sound feasible, but the thing
> >> that comes closest would probably be resurrecting the "pg preferred"
> >> mechanisms in CRUSH and the Ceph codebase. You'll have to go back
> >> through the git history to find it, but once upon a time we supported
> >> a mechanism that let you specify a specific OSD you wanted a
> >> particular object to live on, and then it would place the remaining
> >> replicas using CRUSH.
> >> -Greg
> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
> >>
> >>>
> >>> Let me know about the most neat way to go about it. Appreciate any
> >>> help I can get.
> >>>
> >>> Regards,
> >>> Shayan Saeed
> >>> Research Assistant, Systems Research Lab
> >>> University of Illinois Urbana-Champaign
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html