Hello,
*Q1: *
I've installed glusterfs-1.3.8pre1 on one node of a cluster running
1.3.7, but glusterfsd 1.3.8 drops incoming connections from 1.3.7
clients. Is this by design ? Do I need to upgrade everything to
1.3.8pre1 at once ?
Here's a part of /var/log/glusterfs/glusterfsd.log
2008-02-22 17:17:33 C [tcp.c:87:tcp_disconnect] server: connection
disconnected
2008-02-22 17:17:33 E [server-protocol.c:183:generic_reply] server:
transport_writev failed
2008-02-22 17:17:33 C [tcp.c:87:tcp_disconnect] server: connection
disconnected
2008-02-22 17:17:33 E [server-protocol.c:183:generic_reply] server:
transport_writev failed
2008-02-22 17:17:33 C [tcp.c:87:tcp_disconnect] server: connection
disconnected
2008-02-22 17:17:33 C [tcp.c:87:tcp_disconnect] server: connection
disconnected
*Q2: *
The reasons for trying to upgrade to 1.3.8 are the following:
Current configuration:
- 16 clients / 16 servers (one client/server on each machine)
- servers are dual opteron, some of them quad core, 8 or 12 gb ram
- kernel 2.6.24-2, linux gentoo (can provide gluster ebuilds)
- fuse 2.7.2glfs8, glusterfs 1.3.7 - see config files- basicly a simple
unify with no ra/wt cache
Configs are here: http://gluster.pastebin.com/m7f61927f
All servers are stable and the problems below are in normal running
conditions.
Inside the gluster filesystem we store ~3 million pictures, in a
directory tree that guarantees up
to 1k pictures or subdirectories per directory, with ~30 writes per
second, and ~300 reads per
second. Files are relatively small, 4-5k/picture.
1. glusterfs (client) appears to memory leak in our configuration - 300
mb RAM eaten over 2 days.
2. frequent files with size 0, ctime 0 (1970) even if all servers are up
and running.
3. occasional files with correct size/ctime that cannot be read, and
sometimes they can be
read from other servers.
4. back when I was using AFR for mirrored namespace (which I gave up,
trying to alleviate the
other errors), crash in AFR in glusterfs (client) when one of the
servers was shutting down.
These errors appear in glusterfs.log when a file cannot be read:
2008-02-22 17:37:36 E [unify.c:837:unify_open]
nowb-nora-client-stable-www: /tmpfs/small/1/70/92/7092182.jpg:
entry_count is 4
2008-02-22 17:34:51 E [unify.c:790:unify_open_cbk]
nowb-nora-client-stable-www: Open success on namespace, failed on child node
*Q3:*
Nice-to-haves:
1. Redundant namespace => no single point of failure.
2. A way to see a diagram of the cluster, it's connected nodes, etc. for
anyone running more than 2-3 servers - pulling live data from one of the
servers/clients.
As a side question, our organization could commit a part time developer
dedicated to helping out with glusterfs; are you interested ?
Best regards
Dan