Re: How are you using Ceph?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Right, it just takes time to grow these things.
Maybe the process could be accelerated by being more out there, but what do I know about marketing.. not much :)

Dieter

On Tue, 18 Sep 2012 10:27:52 -0500
Mark Nelson <mark.nelson@xxxxxxxxxxx> wrote:

> Hi Dieter,
> 
> It sounds like some of those things will come with time (more 
> experienced community, docs, deployments, papers, etc).  Are there other 
> things we could be doing that would make Ceph feel less risky for people 
> doing similar comparisons?
> 
> Thanks,
> Mark
> 
> On 09/18/2012 10:19 AM, Plaetinck, Dieter wrote:
> > I don't mind.
> > Ultimately it came down to ceph vs swift for us.
> > Nothing is cast in stone yet, but we choose swift for our new not-yet-production cluster, because
> > swift has has been around longer and has more production deployments, and hence a bigger/more experienced community, better documentation (both official as well as unofficial, blogs, tutorials etc), more conferences/techtalks.
> > It's also a more simple system that reuses more existing technology, which makes it (a bit?) less efficient, but makes it easier to understand. (http protocol vs custom protocol, cluster metadata in sqlite, python which I'm more comfortable with than C, etc).
> > I would like to implement Ceph (because on paper it's just awesome) but running it involves a certain uncertainty/risk I personally don't want to take yet.
> >
> > Dieter
> >
> > On Tue, 18 Sep 2012 09:56:50 -0500
> > Mark Nelson<mark.nelson@xxxxxxxxxxx>  wrote:
> >
> >> Agreed, this was a really interesting writeup!  Thanks John!
> >>
> >> Dieter, do you mind if I ask what is compelling for you in choosing
> >> swift vs the other options you've looked at including Ceph?
> >>
> >> Thanks,
> >> Mark
> >>
> >> On 09/18/2012 09:51 AM, Plaetinck, Dieter wrote:
> >>> thanks a lot for the detailed writeup, I found it quite useful.
> >>> the list of contestants is similar to the list I made when researching (and I also had luwak);
> >>> while I also think ceph is very promising and probably deserves to dominate in the future,
> >>> I'm focusing on openstack swift for now. FWIW
> >>>
> >>> Dieter
> >>>
> >>> On Tue, 18 Sep 2012 16:34:23 +0200
> >>> John Axel Eriksson<john@xxxxxxxxx>   wrote:
> >>>
> >>>> I actually opted to not specifically mention the product we had
> >>>> problems with since there have been lots of changes and fixes to it,
> >>>> which we unfortunately were unable to make use of(you'll know why
> >>>> later). But I guess it's interesting enough to go into a little more
> >>>> detail so... before moving to Ceph we were using the Riak Distributed
> >>>> Database from Basho - http://riak.basho.com.
> >>>>
> >>>> First I have to say that Riak is actually pretty awesome in many ways
> >>>> - not in the least operations wise. Compared to Ceph it's alot easier
> >>>> to get up and running and add storage as you go... basically just one
> >>>> command to add a node to the cluster and you only need the address of
> >>>> any other existing node for this. With Riak, every node is the same,
> >>>> so there is no SPOF by default (eg. no MDS, no MON - just nodes).
> >>>>
> >>>> As you might have thought already "Distributed Database isn't exactly
> >>>> the same as Distributed Storage" so why did we use it? Well, there is
> >>>> an add-on to Riak called Luwak, also created and supported by Basho,
> >>>> that is touted as "Large Object Support" where you can store as large
> >>>> objects as you want. I think our main problem was with using this
> >>>> add-on (as I said created and supported by Basho). An object in
> >>>> "standard" riak k/v is limited to... I think around 40 MB, or at least
> >>>> you shouldn't store larger objects than that because it means
> >>>> "trouble". Anyway, we went with Luwak which seemed to be a perfect
> >>>> solution for the type of storage we do.
> >>>>
> >>>> We ran with Luwak for almost two years and usually it served us pretty
> >>>> well. Unfortunately there were bugs and hidden problems which i.m.o
> >>>> Basho should have been more open about. One issue is that Riak is
> >>>> based on a repair mechanism called "read-repair" - that pretty much
> >>>> tells you how it works, data will only be repaired on a read. Now that
> >>>> is a problem in itself when you archive data which we do (eg. not
> >>>> reading it very often or at all).
> >>>>
> >>>> With Luwak(the large-object add-on), data is split into many keys and
> >>>> values and stored in the "normal" riak k/v store... unfortunately
> >>>> read-repair in this scenario doesn't seem to work at all and if
> >>>> something was missing - Riak had a tendency to crash HARD, sometimes
> >>>> managing to take the whole machine with it. There were also strange
> >>>> issues where one crashing node seemed to affect it's neighbors so that
> >>>> they also crashed... a domino effect which makes "distributed" a
> >>>> little too "distributed". This didn't always happen but it did happen
> >>>> several times in our case. The logs were often pretty hard to
> >>>> understand and more often than not left us completely in the dark
> >>>> about what was going on.
> >>>>
> >>>> We also discovered that deleting data in Luwak doesn't actually DO
> >>>> anything... sure the key is gone but data is still on disk - seemingly
> >>>> orphaned, so deleting was more or less a noop. This was nowhere to be
> >>>> found in the docs.
> >>>>
> >>>> Finally, I think 3rd of June this year, we requested paid support from
> >>>> Basho to help us in our last crash-and-burn situation and that's when
> >>>> we, among other things, were told about the fact that DELETEing just
> >>>> seems to work. We were also told that Luwak was originally created to
> >>>> store email and not really the types of things we store (eg. files).
> >>>> This information wasn't available anywhere - Luwak simply had the
> >>>> wrong "table of contents" associated with it. All this was quite a
> >>>> turn-off for us. To Bashos credit they really did help us fix our
> >>>> cluster and they're really nice, friendly and helpful guys.
> >>>>
> >>>> Actually I think the last straw was when Luwak was suddenly - out of
> >>>> nowhere really - discontinued around the beginning of this year,
> >>>> probably because of the bugs and hidden problems that I think may have
> >>>> come from a less than stellar implementation of large-object support
> >>>> from the start... so by then we were on something completely
> >>>> unsupported. We couldn't switch to something else immediately of
> >>>> course but we started looking around for something else at that time.
> >>>> That's when I found Ceph among other more or less distributed systems,
> >>>> where the others were:
> >>>>
> >>>> Tahoe-LAFS       https://tahoe-lafs.org/trac/tahoe-lafs
> >>>> XtreemFS         http://www.xtreemfs.org
> >>>> HDFS             http://hadoop.apache.org/hdfs/
> >>>> GlusterFS        http://www.gluster.org
> >>>> PomegranateFS    https://github.com/macan/Pomegranate/wiki
> >>>> moosefs          http://www.moosefs.org
> >>>> Openstack Swift  http://docs.openstack.org/developer/swift/
> >>>> MongoDB GridFS   http://www.mongodb.org/display/DOCS/GridFS
> >>>> LS4              http://ls4.sourceforge.net/
> >>>>
> >>>> After trying most of these I decided to look closer at a few of them,
> >>>> MooseFS, HDFS, XtreemFS and Ceph - the others were either not really
> >>>> suited for our use case or just too complicated to setup and keep
> >>>> running (i.m.o). For a short while I dabbled in writing my own storage
> >>>> system using zeromq for communication but it's just not what our
> >>>> company does - so I gave that up pretty quickly :-). In the end I
> >>>> chose Ceph. Ceph wasn't as easy as Riak/Luwak operationally but in
> >>>> every other aspect better and definitely a good fit. The Rados
> >>>> Gateway(S3 compat) was really a big thing for us as well.
> >>>>
> >>>> As I started out saying: there have been many improvements to Riak not
> >>>> in the least to the large-object support... but that large-object
> >>>> support is not built on Luwak but a completely new thing and it's not
> >>>> open source or free. It's called Riak CS(CS for Cluster Storage I
> >>>> guess) and has an S3 compatible interface and it seems to be pretty
> >>>> good. We had many discussions internally if Riak CS was the right move
> >>>> for us but in the end we decided on Ceph since we couldn't justify the
> >>>> cost of Riak CS.
> >>>>
> >>>> To sum it up: we made, in retrospect, a bad choice - not because Riak
> >>>> itself doesn't work or isn't any good for the things it's good at(it
> >>>> really is!) but because the add-on Luwak was misrepresented and not a
> >>>> good fit for us.
> >>>>
> >>>> I really have high hopes for Ceph and I think it has a bright future
> >>>> in our company and in general. Riak CS would probably have been a very
> >>>> good fit as well if it wasn't for the cost involved.
> >>>>
> >>>> So there you have it - not just failure scenarios but bad decisions,
> >>>> misrepresenation of features and somewhat sparse documentation. By the
> >>>> the way, Ceph has improved it's docs alot but still could use some
> >>>> work.
> >>>>
> >>>> -John
> >>>>
> >>>>
> >>>> On Tue, Sep 18, 2012 at 9:47 AM, Plaetinck, Dieter<dieter@xxxxxxxxx>   wrote:
> >>>>> On Tue, 18 Sep 2012 01:26:03 +0200
> >>>>> John Axel Eriksson<john@xxxxxxxxx>   wrote:
> >>>>>
> >>>>>> another distributed
> >>>>>> storage solution that had failed us more than once and we lost data.
> >>>>>> Since the old system had an http interface (not S3 compatible though)
> >>>>>
> >>>>> can you say a bit more about this? failure stories are very interesting and useful.
> >>>>>
> >>>>> Dieter
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux