I actually opted to not specifically mention the product we had problems with since there have been lots of changes and fixes to it, which we unfortunately were unable to make use of(you'll know why later). But I guess it's interesting enough to go into a little more detail so... before moving to Ceph we were using the Riak Distributed Database from Basho - http://riak.basho.com. First I have to say that Riak is actually pretty awesome in many ways - not in the least operations wise. Compared to Ceph it's alot easier to get up and running and add storage as you go... basically just one command to add a node to the cluster and you only need the address of any other existing node for this. With Riak, every node is the same, so there is no SPOF by default (eg. no MDS, no MON - just nodes). As you might have thought already "Distributed Database isn't exactly the same as Distributed Storage" so why did we use it? Well, there is an add-on to Riak called Luwak, also created and supported by Basho, that is touted as "Large Object Support" where you can store as large objects as you want. I think our main problem was with using this add-on (as I said created and supported by Basho). An object in "standard" riak k/v is limited to... I think around 40 MB, or at least you shouldn't store larger objects than that because it means "trouble". Anyway, we went with Luwak which seemed to be a perfect solution for the type of storage we do. We ran with Luwak for almost two years and usually it served us pretty well. Unfortunately there were bugs and hidden problems which i.m.o Basho should have been more open about. One issue is that Riak is based on a repair mechanism called "read-repair" - that pretty much tells you how it works, data will only be repaired on a read. Now that is a problem in itself when you archive data which we do (eg. not reading it very often or at all). With Luwak(the large-object add-on), data is split into many keys and values and stored in the "normal" riak k/v store... unfortunately read-repair in this scenario doesn't seem to work at all and if something was missing - Riak had a tendency to crash HARD, sometimes managing to take the whole machine with it. There were also strange issues where one crashing node seemed to affect it's neighbors so that they also crashed... a domino effect which makes "distributed" a little too "distributed". This didn't always happen but it did happen several times in our case. The logs were often pretty hard to understand and more often than not left us completely in the dark about what was going on. We also discovered that deleting data in Luwak doesn't actually DO anything... sure the key is gone but data is still on disk - seemingly orphaned, so deleting was more or less a noop. This was nowhere to be found in the docs. Finally, I think 3rd of June this year, we requested paid support from Basho to help us in our last crash-and-burn situation and that's when we, among other things, were told about the fact that DELETEing just seems to work. We were also told that Luwak was originally created to store email and not really the types of things we store (eg. files). This information wasn't available anywhere - Luwak simply had the wrong "table of contents" associated with it. All this was quite a turn-off for us. To Bashos credit they really did help us fix our cluster and they're really nice, friendly and helpful guys. Actually I think the last straw was when Luwak was suddenly - out of nowhere really - discontinued around the beginning of this year, probably because of the bugs and hidden problems that I think may have come from a less than stellar implementation of large-object support from the start... so by then we were on something completely unsupported. We couldn't switch to something else immediately of course but we started looking around for something else at that time. That's when I found Ceph among other more or less distributed systems, where the others were: Tahoe-LAFS https://tahoe-lafs.org/trac/tahoe-lafs XtreemFS http://www.xtreemfs.org HDFS http://hadoop.apache.org/hdfs/ GlusterFS http://www.gluster.org PomegranateFS https://github.com/macan/Pomegranate/wiki moosefs http://www.moosefs.org Openstack Swift http://docs.openstack.org/developer/swift/ MongoDB GridFS http://www.mongodb.org/display/DOCS/GridFS LS4 http://ls4.sourceforge.net/ After trying most of these I decided to look closer at a few of them, MooseFS, HDFS, XtreemFS and Ceph - the others were either not really suited for our use case or just too complicated to setup and keep running (i.m.o). For a short while I dabbled in writing my own storage system using zeromq for communication but it's just not what our company does - so I gave that up pretty quickly :-). In the end I chose Ceph. Ceph wasn't as easy as Riak/Luwak operationally but in every other aspect better and definitely a good fit. The Rados Gateway(S3 compat) was really a big thing for us as well. As I started out saying: there have been many improvements to Riak not in the least to the large-object support... but that large-object support is not built on Luwak but a completely new thing and it's not open source or free. It's called Riak CS(CS for Cluster Storage I guess) and has an S3 compatible interface and it seems to be pretty good. We had many discussions internally if Riak CS was the right move for us but in the end we decided on Ceph since we couldn't justify the cost of Riak CS. To sum it up: we made, in retrospect, a bad choice - not because Riak itself doesn't work or isn't any good for the things it's good at(it really is!) but because the add-on Luwak was misrepresented and not a good fit for us. I really have high hopes for Ceph and I think it has a bright future in our company and in general. Riak CS would probably have been a very good fit as well if it wasn't for the cost involved. So there you have it - not just failure scenarios but bad decisions, misrepresenation of features and somewhat sparse documentation. By the the way, Ceph has improved it's docs alot but still could use some work. -John On Tue, Sep 18, 2012 at 9:47 AM, Plaetinck, Dieter <dieter@xxxxxxxxx> wrote: > On Tue, 18 Sep 2012 01:26:03 +0200 > John Axel Eriksson <john@xxxxxxxxx> wrote: > >> another distributed >> storage solution that had failed us more than once and we lost data. >> Since the old system had an http interface (not S3 compatible though) > > can you say a bit more about this? failure stories are very interesting and useful. > > Dieter -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html