Re: Limitations of Ceph

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 27 Aug 2013, Guido Winkelmann wrote:
> Hi Sage,
> 
> Thanks for your comments, much appreciated.
> 
> Am Dienstag, 27. August 2013, 10:19:46 schrieb Sage Weil:
> > Hi Guido!
> > 
> > On Tue, 27 Aug 2013, Guido Winkelmann wrote:
> [...]
> > > - There is no dynamic tiered storage, and there probably never will be, if
> > > I understand the architecture correctly.
> > > You can have different pools with different perfomance characteristics
> > > (like one on cheap and large 7200 RPM disks, and another on SSDs), but
> > > once you have put a given bunch of data on one pool, it is pretty much
> > > stuck there. (I.e. you cannot move it to another pool without very tight
> > > and very manual coordination with all clients using it.)
> > 
> > This is a key item on the roadmap for Emperor (nov) and Firefly (feb).
> > We are building two capabilities: 'cache pools' that let you put fast
> > storage in front of your main data pool, and a tiered 'cold' pool that
> > lets you bleed cold objects off to a cheaper, slower tier
> 
> Sounds interesting.
> Will that work on entire PGs or on single objects? How do you keep track 
> of which object lies on what pool without resorting to a lookup step 
> before every operation? Will that feature retain backwards compatibility 
> with older Ceph clients?

In both cases it is object-granularity.  PG granularity does not make 
sense because objects are randomly distributed across PGs.

For the cache pool, the clients will check the cache then the main pool.  
The cache pool should be built with devices that are fast and low-latency 
(i.e. flash).

For the cold pool, there is 'redirect' in the main pool that sends the 
reads to the cold tier.  Some as-yet unspecified policy will control when 
the cold object is brought back into the main pool (e.g., on any write, 
after multiple reads, etc.).

> > (probably using erasure coding.. which is also coming in firefly).
> 
> ... which happens to address another issue I forgot to mention
> 
> > > - There is no active data deduplication, and, again, if I understand the
> > > architecture correctly, there probably never will be.
> > > There is, however, sparse allocation and COW-cloning for RBD volumes,
> > > which does something similar. Under certain conditions, it is even
> > > possible to use the discard option of modern filesystems to automatically
> > > keep unused regions of an RBD volume sparse.
> > 
> > You can do two things:
> > 
> > - Do dedup inside an osd.  Btrfs is growing this capability, and ZFS
> > already has it.  This is not ideal because data is random distributed
> > across nodes.
> > 
> > - You can build dedup on top of rados, for example by naming objects after
> > a hash of their content.  This will never be a 'magic and transparent
> > dedup for all rados apps' because CAS is based on naming objects from
> > content, and rados fundamentally places data based on name and eschews
> > metadata.  That means there isn't normally a way to point to the content
> > unless there is some MDS on top of rados.  Someday CephFS will get this,
> > but raw librados users and RBD won't get it for free.
> 
> I read that as TL;DR: No real deduplication.
>  
> > > - Bad support for multiple customers accessing the same cluster.
> > > This is assuming that, if you have multiple customers, it is imperative
> > > that any one given customer must be unable to access or even modify the
> > > data of any other customer. You can have authorization on the pool layer,
> > > but it has been reported that Ceph reacts badly to defining a large
> > > number of pools. Multi-customer support in CephFS is non-existant.
> > > RadosGW probably supports multi-customer, but I haven't tried it.
> > 
> > The just-released Dumpling included support for rados namespaces, which
> > are designed to address exactly this issue.  Namespaces exist "inside"
> > pools, and the auth capabilities can restrict access to a specific
> > namespace.
> 
> I'm having some trouble finding this in the documentation. Can you give me a 
> pointer here?

It's a new addition to the librados API:

https://github.com/ceph/ceph/blob/master/src/include/rados/librados.hpp#L664

but is not well documented.  There is a blueprint for adding support in 
RBD, which is how most users will probably consume it.

 http://wiki.ceph.com/01Planning/02Blueprints/Emperor/rbd%3A_namespace_support
 http://pad.ceph.com/p/rbd-namespaces

> > > - No dynamic partitioning for CephFS
> > > The original paper talked about dynamic partioning of the CephFS
> > > namespace, so that multiple Metadata Servers could share the workload of
> > > a large number of CephFS clients. This isn't implemented yet (or
> > > implemented but not working properly?), and the only currently support
> > > multi-MDS configuration is 1 active / n standby. This limits the
> > > scalability of CephFS. It looks to me like CephFS is not a major focus of
> > > the development team at this time.
> > 
> > This has been implemented since ~2006.  We do not recommend it for
> > production because it has not had the QA attention it deserves.  That
> > said, Zheng Yan has been doing a lot of great work here recently and
> > things have improved considerably.  Please try it!  You just need to do
> > 'ceph mds set_max_mds 3' (or whatever) to tell ceph how many active
> > ceph-mds daemons you want.
> 
> Okay, I think I will try this.

Let us know how it goes!

sage

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux