Re: Limitations of Ceph

Sage Weil <sage@xxxxxxxxxxx> · Tue, 27 Aug 2013 10:19:46 -0700 (PDT)

Hi Guido!

On Tue, 27 Aug 2013, Guido Winkelmann wrote:
> Hi,
> 
> I have been running a small Ceph cluster for experimentation for a while, and 
> now my employer has asked me to do little talk about my findings, and one 
> important part is, of course, going to be practical limitations of Ceph.
> 
> Here is my list so far:
> 
> - Ceph is not supported by VMWare ESX. That may change in the future, but 
> seeing how VMWare is now owned by EMC, they might make it a political decision 
> to not support Ceph.
> Apparently, you can import an RBD volume on linux server and then reexport it 
> to a VMWare host as an iSCSI target, but doing so would introduce a bottleneck 
> and a single point of failure, which kind of defeats the purpose of having a 
> Ceph cluster in the first place.

It will be a challenge to make ESX natively support RBD as RBD is open 
source (ESX is proprietary), ESX is (I think) based on a *BSD kernel, and 
VMWare just announced a possibly competitive product.  Inktank is doing 
what it can.

Meanwhile, we are pursuing a robust iSCSI solution. Sadly this will 
require a traditional HA failover setup, but that's how the cookie 
crumbles when you use legacy protocols.

> - Ceph is not supported by Windows clients, or even, as far as I can tell, 
> anything that isn't a very recent version of Linux. (User space only clients 
> work in some cases.)

There is ongoing work here; nothing to announce yet.

> - There is no dynamic tiered storage, and there probably never will be, if I 
> understand the architecture correctly.
> You can have different pools with different perfomance characteristics (like 
> one on cheap and large 7200 RPM disks, and another on SSDs), but once you have 
> put a given bunch of data on one pool, it is pretty much stuck there. (I.e. 
> you cannot move it to another pool without very tight and very manual 
> coordination with all clients using it.)

This is a key item on the roadmap for Emperor (nov) and Firefly (feb).  
We are building two capabilities: 'cache pools' that let you put fast 
storage in front of your main data pool, and a tiered 'cold' pool that 
lets you bleed cold objects off to a cheaper, slower tier (probably using 
erasure coding.. which is also coming in firefly).

> - There is no active data deduplication, and, again, if I understand the 
> architecture correctly, there probably never will be.
> There is, however, sparse allocation and COW-cloning for RBD volumes, which 
> does something similar. Under certain conditions, it is even possible to use 
> the discard option of modern filesystems to automatically keep unused regions 
> of an RBD volume sparse.

You can do two things:

- Do dedup inside an osd.  Btrfs is growing this capability, and ZFS 
already has it.  This is not ideal because data is random distributed 
across nodes.

- You can build dedup on top of rados, for example by naming objects after 
a hash of their content.  This will never be a 'magic and transparent 
dedup for all rados apps' because CAS is based on naming objects from 
content, and rados fundamentally places data based on name and eschews 
metadata.  That means there isn't normally a way to point to the content 
unless there is some MDS on top of rados.  Someday CephFS will get this, 
but raw librados users and RBD won't get it for free.

> - Bad support for multiple customers accessing the same cluster.
> This is assuming that, if you have multiple customers, it is imperative that 
> any one given customer must be unable to access or even modify the data of any 
> other customer. You can have authorization on the pool layer, but it has been 
> reported that Ceph reacts badly to defining a large number of pools.
> Multi-customer support in CephFS is non-existant.
> RadosGW probably supports multi-customer, but I haven't tried it.

The just-released Dumpling included support for rados namespaces, which 
are designed to address exactly this issue.  Namespaces exist "inside" 
pools, and the auth capabilities can restrict access to a specific 
namespace.

> - No dynamic partitioning for CephFS
> The original paper talked about dynamic partioning of the CephFS namespace, so 
> that multiple Metadata Servers could share the workload of a large number of 
> CephFS clients. This isn't implemented yet (or implemented but not working 
> properly?), and the only currently support multi-MDS configuration is 1 active 
> / n standby. This limits the scalability of CephFS. It looks to me like CephFS 
> is not a major focus of the development team at this time.

This has been implemented since ~2006.  We do not recommend it for 
production because it has not had the QA attention it deserves.  That 
said, Zheng Yan has been doing a lot of great work here recently and 
things have improved considerably.  Please try it!  You just need to do 
'ceph mds set_max_mds 3' (or whatever) to tell ceph how many active 
ceph-mds daemons you want.

Hope that helps!

sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com