Re: How are you using Ceph?

John Axel Eriksson <john@xxxxxxxxx> · Tue, 18 Sep 2012 16:34:23 +0200

I actually opted to not specifically mention the product we had
problems with since there have been lots of changes and fixes to it,
which we unfortunately were unable to make use of(you'll know why
later). But I guess it's interesting enough to go into a little more
detail so... before moving to Ceph we were using the Riak Distributed
Database from Basho - http://riak.basho.com.

First I have to say that Riak is actually pretty awesome in many ways
- not in the least operations wise. Compared to Ceph it's alot easier
to get up and running and add storage as you go... basically just one
command to add a node to the cluster and you only need the address of
any other existing node for this. With Riak, every node is the same,
so there is no SPOF by default (eg. no MDS, no MON - just nodes).

As you might have thought already "Distributed Database isn't exactly
the same as Distributed Storage" so why did we use it? Well, there is
an add-on to Riak called Luwak, also created and supported by Basho,
that is touted as "Large Object Support" where you can store as large
objects as you want. I think our main problem was with using this
add-on (as I said created and supported by Basho). An object in
"standard" riak k/v is limited to... I think around 40 MB, or at least
you shouldn't store larger objects than that because it means
"trouble". Anyway, we went with Luwak which seemed to be a perfect
solution for the type of storage we do.

We ran with Luwak for almost two years and usually it served us pretty
well. Unfortunately there were bugs and hidden problems which i.m.o
Basho should have been more open about. One issue is that Riak is
based on a repair mechanism called "read-repair" - that pretty much
tells you how it works, data will only be repaired on a read. Now that
is a problem in itself when you archive data which we do (eg. not
reading it very often or at all).

With Luwak(the large-object add-on), data is split into many keys and
values and stored in the "normal" riak k/v store... unfortunately
read-repair in this scenario doesn't seem to work at all and if
something was missing - Riak had a tendency to crash HARD, sometimes
managing to take the whole machine with it. There were also strange
issues where one crashing node seemed to affect it's neighbors so that
they also crashed... a domino effect which makes "distributed" a
little too "distributed". This didn't always happen but it did happen
several times in our case. The logs were often pretty hard to
understand and more often than not left us completely in the dark
about what was going on.

We also discovered that deleting data in Luwak doesn't actually DO
anything... sure the key is gone but data is still on disk - seemingly
orphaned, so deleting was more or less a noop. This was nowhere to be
found in the docs.

Finally, I think 3rd of June this year, we requested paid support from
Basho to help us in our last crash-and-burn situation and that's when
we, among other things, were told about the fact that DELETEing just
seems to work. We were also told that Luwak was originally created to
store email and not really the types of things we store (eg. files).
This information wasn't available anywhere - Luwak simply had the
wrong "table of contents" associated with it. All this was quite a
turn-off for us. To Bashos credit they really did help us fix our
cluster and they're really nice, friendly and helpful guys.

Actually I think the last straw was when Luwak was suddenly - out of
nowhere really - discontinued around the beginning of this year,
probably because of the bugs and hidden problems that I think may have
come from a less than stellar implementation of large-object support
from the start... so by then we were on something completely
unsupported. We couldn't switch to something else immediately of
course but we started looking around for something else at that time.
That's when I found Ceph among other more or less distributed systems,
where the others were:

Tahoe-LAFS       https://tahoe-lafs.org/trac/tahoe-lafs
XtreemFS         http://www.xtreemfs.org
HDFS             http://hadoop.apache.org/hdfs/
GlusterFS        http://www.gluster.org
PomegranateFS    https://github.com/macan/Pomegranate/wiki
moosefs          http://www.moosefs.org
Openstack Swift  http://docs.openstack.org/developer/swift/
MongoDB GridFS   http://www.mongodb.org/display/DOCS/GridFS
LS4              http://ls4.sourceforge.net/

After trying most of these I decided to look closer at a few of them,
MooseFS, HDFS, XtreemFS and Ceph - the others were either not really
suited for our use case or just too complicated to setup and keep
running (i.m.o). For a short while I dabbled in writing my own storage
system using zeromq for communication but it's just not what our
company does - so I gave that up pretty quickly :-). In the end I
chose Ceph. Ceph wasn't as easy as Riak/Luwak operationally but in
every other aspect better and definitely a good fit. The Rados
Gateway(S3 compat) was really a big thing for us as well.

As I started out saying: there have been many improvements to Riak not
in the least to the large-object support... but that large-object
support is not built on Luwak but a completely new thing and it's not
open source or free. It's called Riak CS(CS for Cluster Storage I
guess) and has an S3 compatible interface and it seems to be pretty
good. We had many discussions internally if Riak CS was the right move
for us but in the end we decided on Ceph since we couldn't justify the
cost of Riak CS.

To sum it up: we made, in retrospect, a bad choice - not because Riak
itself doesn't work or isn't any good for the things it's good at(it
really is!) but because the add-on Luwak was misrepresented and not a
good fit for us.

I really have high hopes for Ceph and I think it has a bright future
in our company and in general. Riak CS would probably have been a very
good fit as well if it wasn't for the cost involved.

So there you have it - not just failure scenarios but bad decisions,
misrepresenation of features and somewhat sparse documentation. By the
the way, Ceph has improved it's docs alot but still could use some
work.

-John

On Tue, Sep 18, 2012 at 9:47 AM, Plaetinck, Dieter <dieter@xxxxxxxxx> wrote:
> On Tue, 18 Sep 2012 01:26:03 +0200
> John Axel Eriksson <john@xxxxxxxxx> wrote:
>
>> another distributed
>> storage solution that had failed us more than once and we lost data.
>> Since the old system had an http interface (not S3 compatible though)
>
> can you say a bit more about this? failure stories are very interesting and useful.
>
> Dieter
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html