Gluster trouble

marcus at synchromedia.co.uk (Marcus Bointon) · Wed, 3 Jul 2013 16:55:59 +0200

Back in March I posted about some gluster problems:

http://gluster.org/pipermail/gluster-users/2013-March/035737.html
http://gluster.org/pipermail/gluster-users/2013-March/035655.html

I'm still in the same situation - a straightforward 2-node, 2-way AFR setup with each server mounting the single shared volume via NFS using gluster 3.3.0 (can't use 3.3.1 dues to its NFS issues) on 64-bit linux (ubuntu lucid). Gluster appears to be working, but won't mount on boot by any means I've tried, and it's still logging prodigious amounts of incomprehensible rubbish (to me!).

gluster says everything is ok:

gluster volume status
Status of volume: shared
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick 192.168.1.10:/var/shared                          24009   Y       3097
Brick 192.168.1.11:/var/shared                          24009   Y       3020
NFS Server on localhost                                 38467   Y       3103
Self-heal Daemon on localhost                           N/A     Y       3109
NFS Server on 192.168.1.11                              38467   Y       3057
Self-heal Daemon on 192.168.1.11                        N/A     Y       3096

(other node says the same thing with IPs the other way around)

Yet the logs tell a different story.

In syslog, this happens every second:

Jul  3 00:17:29 web1 init: glusterd main process (14958) terminated with status 255
Jul  3 00:17:29 web1 init: glusterd main process ended, respawning

In /var/log/glusterfs/etc-glusterfs-glusterd.vol.log I have lots of this:

[2013-07-03 14:24:08.350429] I [glusterfsd.c:1666:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.3.0
[2013-07-03 14:24:08.350592] E [glusterfsd.c:1296:glusterfs_pidfile_setup] 0-glusterfsd: pidfile /var/run/glusterd.pid lock error (Resource temporarily unavailable)

In /var/log/glusterfs/glustershd.log, every minute I get hundreds of these:

2013-07-03 14:24:00.792751] I [afr-self-heal-common.c:2159:afr_self_heal_completion_cbk] 0-shared-replicate-0: background  meta-data self-heal completed on <gfid:16adce4d-1933-485f-8359-66c47c757cd3>
[2013-07-03 14:24:00.794251] I [afr-common.c:1340:afr_launch_self_heal] 0-shared-replicate-0: background  meta-data self-heal triggered. path: <gfid:52bd33a0-0df8-408a-bf6f-c5b6a48c4bd3>, reason: lookup detected pending operations
[2013-07-03 14:24:00.796411] I [afr-self-heal-common.c:2159:afr_self_heal_completion_cbk] 0-shared-replicate-0: background  meta-data self-heal completed on <gfid:52bd33a0-0df8-408a-bf6f-c5b6a48c4bd3>

'gluster volume heal shared info' says:

Heal operation on volume shared has been successful

Brick 192.168.1.10:/var/shared
Number of entries: 335
...

I'm not clear whether this means it has 335 files still to fix, or whether it's done so already.

Both servers are logging the same kind of stuff. I'm sure all these are related since they happen at about the same rate.

The lock error looks the most interesting, but I've no idea why that should happen. As before, I've tried deleting all traces of gluster, reinstalling and reconfiguring and putting all the data back on, but nothing changes.

Here's the command I used to create the volume:

gluster volume create shared replica 2 transport tcp 192.168.1.10:/var/shared 192.168.1.11:/var/shared

Here's the volume file it created:

+------------------------------------------------------------------------------+
  1: volume shared-posix
  2:     type storage/posix
  3:     option directory /var/shared
  4:     option volume-id 2600e26c-b6c4-448f-a6f6-ad27c14745a0
  5: end-volume
  6:
  7: volume shared-access-control
  8:     type features/access-control
  9:     subvolumes shared-posix
 10: end-volume
 11:
 12: volume shared-locks
 13:     type features/locks
 14:     subvolumes shared-access-control
 15: end-volume
 16:
 17: volume shared-io-threads
 18:     type performance/io-threads
 19:     subvolumes shared-locks
 20: end-volume
 21:
 22: volume shared-index
 23:     type features/index
 24:     option index-base /var/shared/.glusterfs/indices
 25:     subvolumes shared-io-threads
 26: end-volume
 27:
 28: volume shared-marker
 29:     type features/marker
 30:     option volume-uuid 2600e26c-b6c4-448f-a6f6-ad27c14745a0
 31:     option timestamp-file /var/lib/glusterd/vols/shared/marker.tstamp
 32:     option xtime off
 33:     option quota off
 34:     subvolumes shared-index
 35: end-volume
 36:
 37: volume /var/shared
 38:     type debug/io-stats
 39:     option latency-measurement off
 40:     option count-fop-hits off
 41:     subvolumes shared-marker
 42: end-volume
 43:
 44: volume shared-server
 45:     type protocol/server
 46:     option transport-type tcp
 47:     option auth.login./var/shared.allow 94017411-d986-48e4-a7ac-47c1db14fba0
 48:     option auth.login.94017411-d986-48e4-a7ac-47c1db14fba0.password 3929acf9-fcf1-4684-b271-07927d375c9b
 49:     option auth.addr./var/shared.allow *
 50:     subvolumes /var/shared
 51: end-volume

Despite all this, I've not seen gluster do anything visibly wrong - if I create a file on the shared volume it appears on the other node, checksums match, clients can read etc, but I don't want to be running on luck! It's all very troubling, and it's making a right mess of my new distributed logging system...

Any ideas?

Marcus