Back in March I posted about some gluster problems: http://gluster.org/pipermail/gluster-users/2013-March/035737.html http://gluster.org/pipermail/gluster-users/2013-March/035655.html I'm still in the same situation - a straightforward 2-node, 2-way AFR setup with each server mounting the single shared volume via NFS using gluster 3.3.0 (can't use 3.3.1 dues to its NFS issues) on 64-bit linux (ubuntu lucid). Gluster appears to be working, but won't mount on boot by any means I've tried, and it's still logging prodigious amounts of incomprehensible rubbish (to me!). gluster says everything is ok: gluster volume status Status of volume: shared Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.1.10:/var/shared 24009 Y 3097 Brick 192.168.1.11:/var/shared 24009 Y 3020 NFS Server on localhost 38467 Y 3103 Self-heal Daemon on localhost N/A Y 3109 NFS Server on 192.168.1.11 38467 Y 3057 Self-heal Daemon on 192.168.1.11 N/A Y 3096 (other node says the same thing with IPs the other way around) Yet the logs tell a different story. In syslog, this happens every second: Jul 3 00:17:29 web1 init: glusterd main process (14958) terminated with status 255 Jul 3 00:17:29 web1 init: glusterd main process ended, respawning In /var/log/glusterfs/etc-glusterfs-glusterd.vol.log I have lots of this: [2013-07-03 14:24:08.350429] I [glusterfsd.c:1666:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.3.0 [2013-07-03 14:24:08.350592] E [glusterfsd.c:1296:glusterfs_pidfile_setup] 0-glusterfsd: pidfile /var/run/glusterd.pid lock error (Resource temporarily unavailable) In /var/log/glusterfs/glustershd.log, every minute I get hundreds of these: 2013-07-03 14:24:00.792751] I [afr-self-heal-common.c:2159:afr_self_heal_completion_cbk] 0-shared-replicate-0: background meta-data self-heal completed on <gfid:16adce4d-1933-485f-8359-66c47c757cd3> [2013-07-03 14:24:00.794251] I [afr-common.c:1340:afr_launch_self_heal] 0-shared-replicate-0: background meta-data self-heal triggered. path: <gfid:52bd33a0-0df8-408a-bf6f-c5b6a48c4bd3>, reason: lookup detected pending operations [2013-07-03 14:24:00.796411] I [afr-self-heal-common.c:2159:afr_self_heal_completion_cbk] 0-shared-replicate-0: background meta-data self-heal completed on <gfid:52bd33a0-0df8-408a-bf6f-c5b6a48c4bd3> 'gluster volume heal shared info' says: Heal operation on volume shared has been successful Brick 192.168.1.10:/var/shared Number of entries: 335 ... I'm not clear whether this means it has 335 files still to fix, or whether it's done so already. Both servers are logging the same kind of stuff. I'm sure all these are related since they happen at about the same rate. The lock error looks the most interesting, but I've no idea why that should happen. As before, I've tried deleting all traces of gluster, reinstalling and reconfiguring and putting all the data back on, but nothing changes. Here's the command I used to create the volume: gluster volume create shared replica 2 transport tcp 192.168.1.10:/var/shared 192.168.1.11:/var/shared Here's the volume file it created: +------------------------------------------------------------------------------+ 1: volume shared-posix 2: type storage/posix 3: option directory /var/shared 4: option volume-id 2600e26c-b6c4-448f-a6f6-ad27c14745a0 5: end-volume 6: 7: volume shared-access-control 8: type features/access-control 9: subvolumes shared-posix 10: end-volume 11: 12: volume shared-locks 13: type features/locks 14: subvolumes shared-access-control 15: end-volume 16: 17: volume shared-io-threads 18: type performance/io-threads 19: subvolumes shared-locks 20: end-volume 21: 22: volume shared-index 23: type features/index 24: option index-base /var/shared/.glusterfs/indices 25: subvolumes shared-io-threads 26: end-volume 27: 28: volume shared-marker 29: type features/marker 30: option volume-uuid 2600e26c-b6c4-448f-a6f6-ad27c14745a0 31: option timestamp-file /var/lib/glusterd/vols/shared/marker.tstamp 32: option xtime off 33: option quota off 34: subvolumes shared-index 35: end-volume 36: 37: volume /var/shared 38: type debug/io-stats 39: option latency-measurement off 40: option count-fop-hits off 41: subvolumes shared-marker 42: end-volume 43: 44: volume shared-server 45: type protocol/server 46: option transport-type tcp 47: option auth.login./var/shared.allow 94017411-d986-48e4-a7ac-47c1db14fba0 48: option auth.login.94017411-d986-48e4-a7ac-47c1db14fba0.password 3929acf9-fcf1-4684-b271-07927d375c9b 49: option auth.addr./var/shared.allow * 50: subvolumes /var/shared 51: end-volume Despite all this, I've not seen gluster do anything visibly wrong - if I create a file on the shared volume it appears on the other node, checksums match, clients can read etc, but I don't want to be running on luck! It's all very troubling, and it's making a right mess of my new distributed logging system... Any ideas? Marcus