Gluster 3.2.1 : Mounted volumes "vanishes" on client side

gluster1206 at akxnet.de (gluster1206 at akxnet.de) · Tue, 30 Aug 2011 22:29:01 +0200

Hi!

I am using Gluster 3.2.1 on a two/three Opensuse 11.3/11.4 server
cluster, where the Gluster nodes are server and client.

While merging the cluster to servers with higher performance, I tried
Gluster 3.3 beta.

Both versions show the same problem:

A single volume (holding the mail base, being accessed by POP3, IMAP and
SMTP server) reports short time after mounting an "Input/Ouput error"
and becomes unaccessible. The same volume on another idle server mounted
still works.

ls /var/vmail
ls: cannot access /var/vmail: Input/output error

lsof /var/vmail
lsof: WARNING: can't stat() fuse.glusterfs file system /var/vmail
      Output information may be incomplete.
lsof: status error on /var/vmail: Input/output error

After unmounting and remounting the volume, the same thing happens.

I tried to recreate the volume, but this does not help.

Although just created, the log is full of "self healing" entries (but
they should not cause the volume to disappear, right?).

I tried it with initially three bricks (and had to remove one) and the
following parameters

Volume Name: vmail
Type: Replicate
Status: Started
Number of Bricks: 3
Transport-type: tcp
Bricks:
Brick1: mx00.akxnet.de:/data/vmail
Brick2: mx02.akxnet.de:/data/vmail
Brick3: mx01.akxnet.de:/data/vmail
Options Reconfigured:
network.ping-timeout: 15
performance.write-behind-window-size: 2097152
auth.allow: xx.xx.xx.xx,yy.yy.yy.yy,zz.zz.zz.zz,127.0.0.1
performance.io-thread-count: 64
performance.io-cache: on
performance.stat-prefetch: on
performance.quick-read: off
nfs.disable: on
performance.cache-size: 32MB and 64 MB

and after the delete/create with two bricks and the following parameters

Volume Name: vmail
Type: Replicate
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: mx02.akxnet.de:/data/vmail
Brick2: mx01.akxnet.de:/data/vmail
Options Reconfigured:
performance.quick-read: off
nfs.disable: on
auth.allow: xx.xx.xx.xx,yy.yy.yy.yy,zz.zz.zz.zz,127.0.0.1

But always the same result.

The log entries

[2011-08-30 22:10:45.376568] I
[afr-self-heal-common.c:1557:afr_self_heal_completion_cbk]
0-vmail-replicate-0: background  data data self-heal completed on
/xxxxx.de/yyyyyyyyyy/.Tauchen/courierimapuiddb
[2011-08-30 22:10:45.385541] I [afr-common.c:801:afr_lookup_done]
0-vmail-replicate-0: background  meta-data self-heal triggered. path:
/xxxxx.de/yyyyyyyyy/.Tauchen/courierimapkeywords

The volume is presently unuseable. Any hint?