Re: Flexible placement

"Adam C. Emerson" <aemerson@xxxxxxxxxx> · Mon, 18 Apr 2016 13:08:36 -0400

On 18/04/2016, Sage Weil wrote:
> If we're just talking about locality here, isn't this tradeoff
> already available to the user by controlling the object size?
> And/or setting the placement key so that two objects are always
> placed together in the cluster.

Object locator is a way to get locality. One worry I have with object
locator is that it's entirely client driven.

> The non-uniformity is a real problem, but let's not assume it's the
> same problem as flexible placement.  If you want linear striping,
> for exmaple, that gives you a perfect distribution but it falls
> apart as soon as there is any rebalancing.
>
> Are there some concrete examples of what these flexible placements
> might be?  I still don't see how they are different that PGs.  I.e.,
> you can split placement into two stages: mapping objects to PGs, and
> mapping PGs to devices.  If both of those are pluggable, what does
> that not capture?

So, given that I think we seem to be using language slightly
differently, I think you're asking if there are any concrete ideas I
know of that don't fit well into the OID => Hash => Device List model,
and depending on how you define 'concrete' I think there might
be. Part of what we'd like to investigate as part of Blue Sky work is
instrumenting a cluster to provide usage data and use that to feed
into a machine learning or annealing system to try and gradually
optimize placement for its workload. In that case you could have
classifiers running over object names sortinf them into categories
with different placement rules, each of which might use a different
hashing strategy to sort objects into OSDs. Data movement could be
required by change in classification as well as hash or rule
set. There IS a hash in here, but trying to narrow it down to the
single pipeline of HERE is the hash and HERE is the list of devices
might be a bad fit.

> (You could map every object to a distinct set of devices, and this
> is frequently suggested, but I don't think it is practical.  In a
> large system, you'll have all-to-all replication streams, and
> *every* possible coincident failure will lead to data loss.)

So. In practice, many placements people will want to use will have the
pattern Object => Hash => Set of devices. Even systems that do
something weird like fiddle with the hash to make it periodic in some
substring of the object name would fit that paradigm.  I don't call
the Hash a 'PG' because when I think 'PG' I think of the /very
specific/ combination of hash and stablemod and whatnot that we happen
to use in Ceph. Or, I also think of the specific class in the OSD that
enforces an ordering o operations and a very specific sort of
replication behaviors.

So, when I say 'other than CRUSH/PG' I mean 'other than the CRUSH
algorithm couypled with the PG hash'.

That said, I don't want to rule out allowing direct Object->Device
mapping. I would like an abstraction where, when we have a hash
feeding into a device list generator, the details need not be
something the client has to concern itself with. That seems to me like
good abstraction.

That way when someone wants a specialized algorithm that goes from
OIDs to names directly (even if it's only useful in very specific
contexts with very specific use cases) we could still write it up,
plug it into the system, and go to town.

> > -   Transactional storage. As mentioned above, cross-object
> >     transactional semantics are a thing people may have desired.
>
> This would just be a new pool type, right?  We definitely need to
> clean up the OSD :: PG interface(s) to enable it.

I suspect for sanity you'd want to mark your entire pool as
Transactional. We'd still need the actual backend implementation to
support it.

> I was assuming that we'd just have multiple instances of the OSD
> class in the same process (as we used to back in the early days with
> fakesyn for testing). Is there any real difference here except
> naming?

I imagined a single OSD object holding several LogicalOSD objects
(mapped from OSD number). I'm not hell bent on this, but OSD and
OSDService have an awful lot of stuff in them and if we could share
most of it between the LogicalOSDs, it would be an efficiency win.

> DataSetInterface still looks like PG (or whatever we name it), a
> minimal interface for passing data and control (peering) messages
> down.

That is correct. It's mostly PG's object operation executing
functions. The main thing was that I was surprised how much stuff in
PG looks to be used by operations downstack that could be hidden away
from the OSD with advantages for allowing freedom in implementing
faster, less lockful data pipelines.

> This sounds like two basic points:
>
> 1) Let's clean up the PG interface already.  Then we can add new
> pool types that aren't primary-driven and/or pglog-based.  Class
> names will shift around as part of this.

That is most of what I was getting at.

> 2) Let's put many OSD's in the same process.  There will be some
> similar naming changes.

That is also most of what I"m getting at. I'm not particularly picky
about names.

> What I don't see is where this means we should move away from PG, or
> an object -> PG -> device mapping strategy.  Am I missing something
> key, or are we just using different words?

Partly, it's that we're using different names. I think when you use
'PG' you use it to mean any hash-like altorithm to cut down the space
of objects to something that can be more tractably dealt with. I use
it to mean a very specific algorithm we happen to use right now.

Partly, I don't want to make the use of a hash like that compulsary. I
wouldn't mind placement functions being able to surface it when there
are useful optimizations that can be made that way. However, part of
the benefit of the NaCl-based Flexible Placement design is that
administrators can put in all sorts of weird things we've never
thought of. If we are to redesign some of our abstractions to support
it and other goals we want (which I think we must), it seems a shame
not to make them as flexible as we reasonably can. After all, we are
building the future of storage, and the Future is full of a lot of
things we haven't thought of.

-- 
Senior Software Engineer           Red Hat Storage, Ann Arbor, MI, US
IRC: Aemerson@{RedHat, OFTC, Freenode}
0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C  7C12 80F7 544B 90ED BFB9
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html