Re: Flexible placement

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 18 Apr 2016 12:07:29 -0700

On Mon, Apr 18, 2016 at 11:57 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Mon, 18 Apr 2016, Adam C. Emerson wrote:
>> > I think that in those cases, we let them use a wacky object -> PG
>> > mapping, and then have a linear/flat PG -> device mapping so they
>> > can still get some uniformity.
>> >
>> > This isn't completely general, but I'd want to see an example of
>> > something it can't express.  Maybe those entanglement erasure codes
>> > that Veronica was talking about at FAST?
>> >
>> > Or maybe the key step is to not assume the PG is a hash range, but
>> > instead consider it an arbitrary bucket in the pool.
>>
>> Would the idea basically be in that case that we go from OID =>
>> WHATEVER => DEVICE LIST without putting too many constraints on what
>> 'WHATEVER' is?
>
> Yeah, probably with the additional/implicit restriction that the OSD ->
> WHATEVER mapping is fixed.  That way when a policy or topology change
> happens, we do O(num WHATEVERs) remapping work.  It's the O(num objects)
> part that is really problematic.
>
> (I would say fundamental, but you *could* imagine something where you know
> the old and new mapping, and try both, or incrementall walk through them..
> but, man, it would be ugly.  At that point you're probably better of just
> migrating objects from pool A to pool B.)

Actually, exactly that is a thing I've blue skied with...somebody...as
a more graceful way of dealing with cluster expansions. You could have
CRUSH forks/epochs/whatever, and when making drastic changes to your
cluster, create a new fork. (eg, we just tripled our storage and want
all new data to be on the new servers, but to keep it in the same
pool, and not go through a data migration right now.) Then any object
access addresses every fork (or maybe you have metadata which knows
which fork it was written in) looking for it.
Then you could do things like incrementally merging forks (eg, fork 1
for all objects > hash x, otherwise it has moved to fork 2) to bring
them into unison.

I'm not 100% certain this is better than just getting our pg
replication easier to work well with, but it has some nice properties
(it makes moving a single PG within the cluster easy and gives the
admin a lot more control) and, unlike more manual movement (rbd
snapshot+copyup) doesn't require intelligent clients, just an
intelligent Objecter. On the downside, object fetches suddenly turn
into 2 seek IOs instead of 1.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html