On 09/16/2009 05:45 AM, Gordan Bobic wrote:
It's not my project (I'm just a user of it), but having done my research, my conclusion is that there is nothing else available that is similar to GlusterFS. The world has waited a long time for this, and imperfect as it may be, I don't see anything else similar on the horizon. GlusterFS is an implementation of something that has only been academically discussed elsewhere. And I haven't seen any evidence of any other similar things being implemented any time soon. But if you think you can do better, go for it. :-)
I came to a slightly different conclusion, but similar effect. Of the projects available, GlusterFS is the closest to production *today*. The world has waited a long time for this. It is imperfect, but right now it's still high on the list of solutions that can be used today and have potential for tomorrow.
In case it is of any use to other, here is the list I had worked out before when doing my analysis:
- GlusterFS (http://gluster.com/community/index.php) - Very promising shared nothing architecture, production ready software supported commercially, based on FUSE (provides insulation from the kernel at a small performance cost). Simple configuration. Very cute implementation where each "brick" for a "cluster/replication" setup is just a regular file system that can be accessed natively, so the data is always safe and can be inspected using UNIX commands or backed up using rsync. Most logic is client side, including replication, and they use file system attributes to journal changes and "self-heal". But, very recently there has been some problems, possibly with how GlusterFS calls Linux, triggering a Linux problem that causes the system to freeze up a bit. My own first test froze things up. The GlusterFS support people want to find the problem and I will be working with them to see whether this can be resolved or not.
- Ceph (http://ceph.newdream.net/) - Very promising shared nothing architecture, that has kernel module support instead of FUSE (better performance) but not ready for production. They say they will stabilize it by the end of 2009, but do not recommend using it for production even at that time.
- PVFS (http://www.pvfs.org/) - Very promising architecture. Widely used in production. V1 has a shared metadata server. V2 they are changing to a shared nothing architecture. Has kernel module support instead of FUSE (better performance). However, PVFS does not provide POSIX guarantees. In particular, the do not implement advisory locking through flock()/fcntl(). This means that use of this system would probably require an architecture that does master/slave fail over as opposed to master/master fail over. Most file system accesses do not care for this level of locking, but dovecot in particular probably does. The dovecot locking through .lock files might work, but I need to look a little closer.
- Grid Datafarm (http://datafarm.apgrid.org/) - Designed as a user space data sharing mechanism, however a FUSE module is available to provide POSIX functionality on top.
- Lustre (http://www.lustre.org/) - Seems to be the focus of the Commercial world. Currently based on ext3/ext4, to be based on ZFS in 2010.Weakness seems to be on having a single shared metadata server that must be highly available using a shared disk solution such as GFS or OCFS. Due to this architecture, I do not consider this solution to meet our requirements of a shared nothing architecture where any server can completely die, and the other server take over the load without intervention.
- MooseFS (http://www.moosefs.com/) - Alternative to Lustre. Still uses a shared metadata server, and therefore does not meet requirements.
- XtreemFS (http://en.wikipedia.org/wiki/XtreemFS) - Very promising architecture. However, current version uses single metadata server and will only replicate content that is specifically marked as read only. Replicated metadata scheduled for 2010Q1. Read/write replication scheduled for some time later.
- CRFS (http://oss.oracle.com/projects/crfs/) - Btrfs based - Btrfs is Oracle's answer to ZFS, and CRFS is Oracle's answer to Lustre, although development of this solution seems slow and this system is not ready for production. Development for both have effectively stalled since 2008. If these are ever released, I think they will be great solutions, but they are apparently having designs problems (either developers who are not good enough, or the design is too complicated, probably both).
- TahoeFS (http://allmydata.org/trac/tahoe) - POSIX interface (via FUSE) not ready for production.
- Coda (http://www.coda.cs.cmu.edu/) and Inter-Mezzo (http://en.wikipedia.org/wiki/InterMezzo_%28file_system%29) - Older "experimental" distributed file system still being maintained, but no development beyond bugfixes that I can see. They say the developers have moved on to Lustre.
I am still having some problems with GlusterFS - I rebooted my machines at the exact same time and all three came up frozen in the mount call. Now that I know how to clear the problem - ssh in with another window, and kill -9 the mount, it isn't so bad - but I can't take this to production unless this issue is resolved. I'll try to come up with better details.
Cheers, mark -- Mark Mielke<mark@xxxxxxxxx>