Re: How are you using Ceph?

"Plaetinck, Dieter" <dieter@xxxxxxxxx> · Tue, 18 Sep 2012 16:51:25 +0200

thanks a lot for the detailed writeup, I found it quite useful.
the list of contestants is similar to the list I made when researching (and I also had luwak);
while I also think ceph is very promising and probably deserves to dominate in the future,
I'm focusing on openstack swift for now. FWIW

Dieter

On Tue, 18 Sep 2012 16:34:23 +0200
John Axel Eriksson <john@xxxxxxxxx> wrote:

> I actually opted to not specifically mention the product we had
> problems with since there have been lots of changes and fixes to it,
> which we unfortunately were unable to make use of(you'll know why
> later). But I guess it's interesting enough to go into a little more
> detail so... before moving to Ceph we were using the Riak Distributed
> Database from Basho - http://riak.basho.com.
> 
> First I have to say that Riak is actually pretty awesome in many ways
> - not in the least operations wise. Compared to Ceph it's alot easier
> to get up and running and add storage as you go... basically just one
> command to add a node to the cluster and you only need the address of
> any other existing node for this. With Riak, every node is the same,
> so there is no SPOF by default (eg. no MDS, no MON - just nodes).
> 
> As you might have thought already "Distributed Database isn't exactly
> the same as Distributed Storage" so why did we use it? Well, there is
> an add-on to Riak called Luwak, also created and supported by Basho,
> that is touted as "Large Object Support" where you can store as large
> objects as you want. I think our main problem was with using this
> add-on (as I said created and supported by Basho). An object in
> "standard" riak k/v is limited to... I think around 40 MB, or at least
> you shouldn't store larger objects than that because it means
> "trouble". Anyway, we went with Luwak which seemed to be a perfect
> solution for the type of storage we do.
> 
> We ran with Luwak for almost two years and usually it served us pretty
> well. Unfortunately there were bugs and hidden problems which i.m.o
> Basho should have been more open about. One issue is that Riak is
> based on a repair mechanism called "read-repair" - that pretty much
> tells you how it works, data will only be repaired on a read. Now that
> is a problem in itself when you archive data which we do (eg. not
> reading it very often or at all).
> 
> With Luwak(the large-object add-on), data is split into many keys and
> values and stored in the "normal" riak k/v store... unfortunately
> read-repair in this scenario doesn't seem to work at all and if
> something was missing - Riak had a tendency to crash HARD, sometimes
> managing to take the whole machine with it. There were also strange
> issues where one crashing node seemed to affect it's neighbors so that
> they also crashed... a domino effect which makes "distributed" a
> little too "distributed". This didn't always happen but it did happen
> several times in our case. The logs were often pretty hard to
> understand and more often than not left us completely in the dark
> about what was going on.
> 
> We also discovered that deleting data in Luwak doesn't actually DO
> anything... sure the key is gone but data is still on disk - seemingly
> orphaned, so deleting was more or less a noop. This was nowhere to be
> found in the docs.
> 
> Finally, I think 3rd of June this year, we requested paid support from
> Basho to help us in our last crash-and-burn situation and that's when
> we, among other things, were told about the fact that DELETEing just
> seems to work. We were also told that Luwak was originally created to
> store email and not really the types of things we store (eg. files).
> This information wasn't available anywhere - Luwak simply had the
> wrong "table of contents" associated with it. All this was quite a
> turn-off for us. To Bashos credit they really did help us fix our
> cluster and they're really nice, friendly and helpful guys.
> 
> Actually I think the last straw was when Luwak was suddenly - out of
> nowhere really - discontinued around the beginning of this year,
> probably because of the bugs and hidden problems that I think may have
> come from a less than stellar implementation of large-object support
> from the start... so by then we were on something completely
> unsupported. We couldn't switch to something else immediately of
> course but we started looking around for something else at that time.
> That's when I found Ceph among other more or less distributed systems,
> where the others were:
> 
> Tahoe-LAFS       https://tahoe-lafs.org/trac/tahoe-lafs
> XtreemFS         http://www.xtreemfs.org
> HDFS             http://hadoop.apache.org/hdfs/
> GlusterFS        http://www.gluster.org
> PomegranateFS    https://github.com/macan/Pomegranate/wiki
> moosefs          http://www.moosefs.org
> Openstack Swift  http://docs.openstack.org/developer/swift/
> MongoDB GridFS   http://www.mongodb.org/display/DOCS/GridFS
> LS4              http://ls4.sourceforge.net/
> 
> After trying most of these I decided to look closer at a few of them,
> MooseFS, HDFS, XtreemFS and Ceph - the others were either not really
> suited for our use case or just too complicated to setup and keep
> running (i.m.o). For a short while I dabbled in writing my own storage
> system using zeromq for communication but it's just not what our
> company does - so I gave that up pretty quickly :-). In the end I
> chose Ceph. Ceph wasn't as easy as Riak/Luwak operationally but in
> every other aspect better and definitely a good fit. The Rados
> Gateway(S3 compat) was really a big thing for us as well.
> 
> As I started out saying: there have been many improvements to Riak not
> in the least to the large-object support... but that large-object
> support is not built on Luwak but a completely new thing and it's not
> open source or free. It's called Riak CS(CS for Cluster Storage I
> guess) and has an S3 compatible interface and it seems to be pretty
> good. We had many discussions internally if Riak CS was the right move
> for us but in the end we decided on Ceph since we couldn't justify the
> cost of Riak CS.
> 
> To sum it up: we made, in retrospect, a bad choice - not because Riak
> itself doesn't work or isn't any good for the things it's good at(it
> really is!) but because the add-on Luwak was misrepresented and not a
> good fit for us.
> 
> I really have high hopes for Ceph and I think it has a bright future
> in our company and in general. Riak CS would probably have been a very
> good fit as well if it wasn't for the cost involved.
> 
> So there you have it - not just failure scenarios but bad decisions,
> misrepresenation of features and somewhat sparse documentation. By the
> the way, Ceph has improved it's docs alot but still could use some
> work.
> 
> -John
> 
> 
> On Tue, Sep 18, 2012 at 9:47 AM, Plaetinck, Dieter <dieter@xxxxxxxxx> wrote:
> > On Tue, 18 Sep 2012 01:26:03 +0200
> > John Axel Eriksson <john@xxxxxxxxx> wrote:
> >
> >> another distributed
> >> storage solution that had failed us more than once and we lost data.
> >> Since the old system had an http interface (not S3 compatible though)
> >
> > can you say a bit more about this? failure stories are very interesting and useful.
> >
> > Dieter

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html