Re: OSDMap partitioning

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 18 Apr 2016 16:27:52 -0400 (EDT)

On Mon, 18 Apr 2016, Matt Benjamin wrote:
> ----- Original Message -----
> 
> > > Second, I've always been working under the assumption that placement
> > > is a function of workload as well as hardware. At least there's a lot
> > > of interesting space in the 'placement function choice'/'workload
> > > optimization' intersection.
> 
> A lot of the CohortFS work after incorporating Ceph was indeed about 
> adapting Ceph abstractions to provide first-class support for workload 
> and tenant isolation, it is for me difficult to imagine not needing this 
> in a system that addresses the problems Ceph does, at the intended 
> scale.
> 
> > 
> > Hmm, this is true.  I've been assuming the workload-informed placement
> > would be a tier, not something within a pool.  The fundamental rados
> > property is that the map is enough to find your data... by it's *name*.
> > The moment the placement depends on who wrote 'foo' (and not the name of
> > 'foo') that doesn't work.
> 
> Tiers are great, but they represent another compositional primitive, not
> an alternative to an ability to fundamentally construct the data/server
> aggregate 

I keep getting stuck on something here, I think, that is keeping me from 
following the logic.  Maybe someone else can tell what it is?

I think I understand what you mean by 'fundamentally construct teh 
data/server aggregate'.  Maybe you want distributed replicated globally 
shared foo.  Maybe you want low-latency immutable bar (new pool type).  
But maybe you want more direct layout control over these 16 nvme cards 
over here, and the ability to define a tenant policy that let's me use it.  
Part of me thinks if you want direct control over layout, don't use 
ceph--just access those cards directly (stripe with dm or something).  But 
maybe you do want global/shared access.  In that case, you want Ceph 
involved.  If you want isolation but shared access, pools are fine--we are 
talking about hardware and are O(size of cluster).  If you are O(tenants), 
you can't have hardware-level isolation, and namespaces work.

Maybe my hang-up is that you're thinking about things that aren't shared, 
or aren't redundant (e.g., this locally attached nvme card on a client 
node)?  Or maybe we're missing a portable abstraction for a local-y thing.  
Like, I am a user on host A and want local-dataset-A on this local nvme.  
But if I move, I want to seamlessly migrate that dataset to my new host B.  
That we can't do.. because it means *global* naming/indirection for a 
per-tenant thing.  If we have a small enough number of tenants that we can 
use a pool, all is well.  But if we want man tenants to be able to do 
this, namespaces in their current form aren't sufficient.

> In addition to workload, there is isolation.  While geopolitical/regulatory
> scale segregation is important in cloud, more fine-grained isolation of
> all different kinds is important for policy contorl within data centers.

Again, it *seems* like the isolation you're talking about is 
hardware-level, which is O(size of cluster), and can be addressed by 
pools.  Beyond that, we also want *virtual isolation* (i.e., QoS), which 
we'll be tackling with something like dmclock.

Maybe a concrete example of how one might 'fundamentally construct a 
data/server aggregate' would be helpful?

Not trying to be difficult, just trying to understand the what and why.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html