stability and configurations

Jacques Mattheij <j@xxxxxx> · Wed, 03 Oct 2007 14:35:20 +0200

Hello there gluster developers and users,

I'm trying to get a handle on what it takes to get glusterfs to
work reliable. After several weeks of testing we have to date
not been able to get it to work stable in our setup, and I'm
beginning to wonder if there is a possible statistical
approach to finding out what works and what doesn't rather
than to try to go about it one bug at a time.

The goal is a 'compatibility' list of sorts, things that you
should do if you intend to run glusterfs in production, and
to get a percentage wise idea of the number of installations
that run smoothly and how many people are experiencing issues.

One of the more nasty pitfalls that we ran in to was that the
build does not warn if a kernel module is being built that
it will never load because of a module linked with the
kernel.

This means that even though you think you are running
an updated fuse in fact you may be using the one that got
linked in with the kernel.

This is easy to check by running lsmod, if that does not show
the fuse module then you are using the kernel one.

This combined with the various reports of success here make
me want to ask the group on this list the following:

- what is your configuration hardware
- what is your configuration glusterfs setup
- have you experienced problems building / installing / using the system

And most interesting if you have not experienced any
difficulty at all is there a single point that you can
indicate that you think sets your setup apart from the
ones that fail (or a common element between yours and
the ones that don't).

This might help to compile a checklist of elements that
might help to create a 'must have' set of conditions
in order to be able to run glusterfs stable in a production
environment.

It will also help to give me a feeling if not having
glusterfs working right 'out of the box' is the rule
rather than the exception.

For starters here is my setup:

5 node cluster, dual opterons, 8G ram per box, supermicro
chassis, 200G sata drives. 100 Mb link to the net, GigE
backchannel between the nodes.

The machines run Debian 'etch' 64 bits linux, kernel version
is 2.6.17.11. Fuse has been upgraded to the glfs4 patch.

Glusterfs configuration:

readahead / writebehind / unify

over all nodes in the cluster (currently only 4 because
one machine developed a hardware problem).

Initially we ran version 1.3.1, but with a lot of problems,
this was later traced to the fuse module not loading (see
above). After that we upgraded to the current tla release
(504), we've had one issue of a glusterfs client process
looping, we're trying to track that bug as well as another
one that caused some problems while running tests.

The tests we are running are 'dbench' with run lengths of
up to 3 days, sometimes it fails quickly and sometimes it
takes longer, dbench simulates a number of users on a lan
accessing a file server and it has so far served well as
a way to bring issues to the surface.

Current status, not stable enough for any production work,
incrementally improving stability with some setbacks.

 best regards,

   Jacques Mattheij