Re: Limitations of Ceph

Guido Winkelmann <guido-ceph@xxxxxxxxxxxxxxxxx> · Tue, 27 Aug 2013 20:01:46 +0200

Hi Sage,

Thanks for your comments, much appreciated.

Am Dienstag, 27. August 2013, 10:19:46 schrieb Sage Weil:
> Hi Guido!
> 
> On Tue, 27 Aug 2013, Guido Winkelmann wrote:
[...]
> > - There is no dynamic tiered storage, and there probably never will be, if
> > I understand the architecture correctly.
> > You can have different pools with different perfomance characteristics
> > (like one on cheap and large 7200 RPM disks, and another on SSDs), but
> > once you have put a given bunch of data on one pool, it is pretty much
> > stuck there. (I.e. you cannot move it to another pool without very tight
> > and very manual coordination with all clients using it.)
> 
> This is a key item on the roadmap for Emperor (nov) and Firefly (feb).
> We are building two capabilities: 'cache pools' that let you put fast
> storage in front of your main data pool, and a tiered 'cold' pool that
> lets you bleed cold objects off to a cheaper, slower tier

Sounds interesting.
Will that work on entire PGs or on single objects? How do you keep track of 
which object lies on what pool without resorting to a lookup step before every 
operation? Will that feature retain backwards compatibility with older Ceph 
clients?

> (probably using erasure coding.. which is also coming in firefly).

... which happens to address another issue I forgot to mention

> > - There is no active data deduplication, and, again, if I understand the
> > architecture correctly, there probably never will be.
> > There is, however, sparse allocation and COW-cloning for RBD volumes,
> > which does something similar. Under certain conditions, it is even
> > possible to use the discard option of modern filesystems to automatically
> > keep unused regions of an RBD volume sparse.
> 
> You can do two things:
> 
> - Do dedup inside an osd.  Btrfs is growing this capability, and ZFS
> already has it.  This is not ideal because data is random distributed
> across nodes.
> 
> - You can build dedup on top of rados, for example by naming objects after
> a hash of their content.  This will never be a 'magic and transparent
> dedup for all rados apps' because CAS is based on naming objects from
> content, and rados fundamentally places data based on name and eschews
> metadata.  That means there isn't normally a way to point to the content
> unless there is some MDS on top of rados.  Someday CephFS will get this,
> but raw librados users and RBD won't get it for free.

I read that as TL;DR: No real deduplication.

> > - Bad support for multiple customers accessing the same cluster.
> > This is assuming that, if you have multiple customers, it is imperative
> > that any one given customer must be unable to access or even modify the
> > data of any other customer. You can have authorization on the pool layer,
> > but it has been reported that Ceph reacts badly to defining a large
> > number of pools. Multi-customer support in CephFS is non-existant.
> > RadosGW probably supports multi-customer, but I haven't tried it.
> 
> The just-released Dumpling included support for rados namespaces, which
> are designed to address exactly this issue.  Namespaces exist "inside"
> pools, and the auth capabilities can restrict access to a specific
> namespace.

I'm having some trouble finding this in the documentation. Can you give me a 
pointer here?

> > - No dynamic partitioning for CephFS
> > The original paper talked about dynamic partioning of the CephFS
> > namespace, so that multiple Metadata Servers could share the workload of
> > a large number of CephFS clients. This isn't implemented yet (or
> > implemented but not working properly?), and the only currently support
> > multi-MDS configuration is 1 active / n standby. This limits the
> > scalability of CephFS. It looks to me like CephFS is not a major focus of
> > the development team at this time.
> 
> This has been implemented since ~2006.  We do not recommend it for
> production because it has not had the QA attention it deserves.  That
> said, Zheng Yan has been doing a lot of great work here recently and
> things have improved considerably.  Please try it!  You just need to do
> 'ceph mds set_max_mds 3' (or whatever) to tell ceph how many active
> ceph-mds daemons you want.

Okay, I think I will try this.

	Guido

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com