Re: Flexible placement

Blair Bethwaite <blair.bethwaite@xxxxxxxxx> · Wed, 20 Apr 2016 04:55:11 +1000

On 19 April 2016 at 05:07, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> On Mon, Apr 18, 2016 at 11:57 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> On Mon, 18 Apr 2016, Adam C. Emerson wrote:
>>> > I think that in those cases, we let them use a wacky object -> PG
>>> > mapping, and then have a linear/flat PG -> device mapping so they
>>> > can still get some uniformity.
>>> >
>>> > This isn't completely general, but I'd want to see an example of
>>> > something it can't express.  Maybe those entanglement erasure codes
>>> > that Veronica was talking about at FAST?
>>> >
>>> > Or maybe the key step is to not assume the PG is a hash range, but
>>> > instead consider it an arbitrary bucket in the pool.
>>>
>>> Would the idea basically be in that case that we go from OID =>
>>> WHATEVER => DEVICE LIST without putting too many constraints on what
>>> 'WHATEVER' is?
>>
>> Yeah, probably with the additional/implicit restriction that the OSD ->
>> WHATEVER mapping is fixed.  That way when a policy or topology change
>> happens, we do O(num WHATEVERs) remapping work.  It's the O(num objects)
>> part that is really problematic.
>>
>> (I would say fundamental, but you *could* imagine something where you know
>> the old and new mapping, and try both, or incrementall walk through them..
>> but, man, it would be ugly.  At that point you're probably better of just
>> migrating objects from pool A to pool B.)
>
> Actually, exactly that is a thing I've blue skied with...somebody...as
> a more graceful way of dealing with cluster expansions. You could have
> CRUSH forks/epochs/whatever, and when making drastic changes to your
> cluster, create a new fork. (eg, we just tripled our storage and want
> all new data to be on the new servers, but to keep it in the same
> pool, and not go through a data migration right now.) Then any object
> access addresses every fork (or maybe you have metadata which knows
> which fork it was written in) looking for it.
> Then you could do things like incrementally merging forks (eg, fork 1
> for all objects > hash x, otherwise it has moved to fork 2) to bring
> them into unison.
>
> I'm not 100% certain this is better than just getting our pg
> replication easier to work well with, but it has some nice properties
> (it makes moving a single PG within the cluster easy and gives the
> admin a lot more control) and, unlike more manual movement (rbd
> snapshot+copyup) doesn't require intelligent clients, just an
> intelligent Objecter. On the downside, object fetches suddenly turn
> into 2 seek IOs instead of 1.

Interesting thread. Just wanted to say, as an institution with a
couple of fairly large clusters and a few years of operational
experience under our belts now, this particular comment of Greg's
resonates.

Crush is awesome and elegant, but one of our biggest operational
pain-points is managing map changes and their impacts. Our reality is
that we impose more map changes on the cluster through standard
maintenance and expansion/refresh activity than do failures (even at
the ~1000 drive scale), and we'd really like to have some more
flexibility around managing those activities.

-- 
Cheers,
~Blairo
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html