Some comments as a user of the open source version, and as a reseller of
the commercial version, including having provided emergency support to
users ... take these with a grain of salt if you wish.
Mark Mielke wrote:
On 09/16/2009 05:45 AM, Gordan Bobic wrote:
It's not my project (I'm just a user of it), but having done my
[...]
I came to a slightly different conclusion, but similar effect. Of the
projects available, GlusterFS is the closest to production *today*. The
As a user of many file systems over (quite) a span of time, I have as of
yet to see "the one true file system that is really bug free, always
works, and never fails." All software is buggy. Some more so than
others, but all software is buggy. Anyone telling you otherwise is
trying to sell something to you.
world has waited a long time for this. It is imperfect, but right now
it's still high on the list of solutions that can be used today and have
potential for tomorrow.
Every storage design and implementation you do, you need to ask yourself
"if this went away, what would be the impact upon me and my work?" You
then need to design to this. Failure to do so ... well ...
In case it is of any use to other, here is the list I had worked out
before when doing my analysis:
- GlusterFS (http://gluster.com/community/index.php) - Very
promising shared nothing architecture, production ready software
supported commercially, based on FUSE (provides insulation from the
kernel at a small performance cost). Simple configuration. Very cute
implementation where each "brick" for a "cluster/replication" setup is
just a regular file system that can be accessed natively, so the data is
always safe and can be inspected using UNIX commands or backed up using
rsync. Most logic is client side, including replication, and they use
file system attributes to journal changes and "self-heal". But, very
recently there has been some problems, possibly with how GlusterFS calls
Linux, triggering a Linux problem that causes the system to freeze up a
bit. My own first test froze things up. The GlusterFS support people
want to find the problem and I will be working with them to see whether
this can be resolved or not.
- Ceph (http://ceph.newdream.net/) - Very promising shared nothing
architecture, that has kernel module support instead of FUSE (better
performance) but not ready for production. They say they will stabilize
it by the end of 2009, but do not recommend using it for production even
at that time.
Ceph is very interesting, and should be one to watch over time. Sage
and group seem to have fewer resources at their disposal than
Z-Research, so evolution may take longer.
- PVFS (http://www.pvfs.org/) - Very promising architecture. Widely
used in production. V1 has a shared metadata server. V2 they are
changing to a shared nothing architecture. Has kernel module support
instead of FUSE (better performance). However, PVFS does not provide
POSIX guarantees. In particular, the do not implement advisory locking
through flock()/fcntl(). This means that use of this system would
probably require an architecture that does master/slave fail over as
opposed to master/master fail over. Most file system accesses do not
care for this level of locking, but dovecot in particular probably does.
The dovecot locking through .lock files might work, but I need to look a
little closer.
PVFS is not a POSIX file system. You shouldn't try to use it as one.
PVFS2 is the current release, and as Dan from Synthetic Genomics might
note, it has some issues with codes that want to use it as a parallel
POSIX file system. PVFS2 is purpose built for MPI-IO and related codes.
There is nothing wrong with this, and in fact, this is a good thing,
as MPI-IO capabilities are very important in HPC sectors.
Probably not so important for Dovecot.
[...]
- Lustre (http://www.lustre.org/) - Seems to be the focus of the
Commercial world. Currently based on ext3/ext4, to be based on ZFS in
2010.Weakness seems to be on having a single shared metadata server that
must be highly available using a shared disk solution such as GFS or
OCFS. Due to this architecture, I do not consider this solution to meet
our requirements of a shared nothing architecture where any server can
completely die, and the other server take over the load without
intervention.
Lustre is dependent upon Sun, and there are, to put it mildly, concerns
over its future within Oracle. Oracle isn't really in the high
performance computing market, which is where Lustre plays. I won't go
into more depth here on its future.
Lustre is predominantly an object based storage system. It depends
critically upon features that require very specific kernels and kernel
patches, which tend to make it incompatible with requirements of keeping
the distro specific kernels.
The migration to ZFS has been seen in some circles (people have
mentioned this to us) as a migration over to solaris, which has caused
numerous users to start to look at transition plans off of Lustre.
Which is hard, when you have Petabytes of data ... moving it ain't easy.
- CRFS (http://oss.oracle.com/projects/crfs/) - Btrfs based - Btrfs
is Oracle's answer to ZFS, and CRFS is Oracle's answer to Lustre,
although development of this solution seems slow and this system is not
ready for production. Development for both have effectively stalled
since 2008. If these are ever released, I think they will be great
solutions, but they are apparently having designs problems (either
developers who are not good enough, or the design is too complicated,
probably both).
BTRFS has most definitely not stalled. It is now in the Linux kernel as
of 2.6.29, and is the target file system for a number of well known
distros going forward. Ext4 is simply not viable for the storage sizes
people are contemplating. Xfs, a venerable file system, has most of its
developers at SGI, which has obvious risks associated with that. jfs
may not be actively developed anymore. Chris Mason has been very
actively doing btrfs work as far as I can tell from the various
sources:http://btrfs.wiki.kernel.org/index.php/Main_Page#News
CRFS is dependent upon BTRFS, so CRFS is more of a placeholder.
With Sun owning ZFS and Oracle BTRFS, given that the latter is GPL
compliant and the former is not (and is patent encumbered), I expect
more work on BTRFS going forward for Linux, an important platform for
Oracle. Solaris is not increasing in installed base, rather it is
rapidly doing the opposite, and this trend isn't likely lost on Oracle.
Of course, we could be wrong, and our biases are in part due to what we
sell, resell, and support, so take what I say with a grain of salt if
you wish.
I do expect GlusterFS to work well atop BTRFS in the not so distant future.
You did neglect pNFS in your notes. Its sort of the "pink elephant" in
the room. There are good things about it, and some ... er ...
challenging things about it. I expect the kerberos requirements (and
all this implies) aren't going to help its adoption. If you haven't
dealt with a kerberos installation and management situation, you might
not get this.
Also pohlemfs in Linux was included in 2.6.29. This is an interesting
parallel file system, but we haven't played with it much yet.
Finally, on the other file systems you should pay attention to, nilfs2
looks to be quite interesting. Continuous snapshotting is quite
interesting, though how this could be used from within in GlusterFS
(GlusterFS atop nilfs2) isn't completely apparent yet. It could make
for some very powerful capability in GlusterFS if the developers go this
route.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman@xxxxxxxxxxxxxxxxxxxxxxx
web : http://scalableinformatics.com
http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615