On Wed, Jul 3, 2013 at 8:25 PM, Marcus Bointon <marcus at synchromedia.co.uk> wrote: > Back in March I posted about some gluster problems: > > http://gluster.org/pipermail/gluster-users/2013-March/035737.html > http://gluster.org/pipermail/gluster-users/2013-March/035655.html > > I'm still in the same situation - a straightforward 2-node, 2-way AFR setup with each server mounting the single shared volume via NFS using gluster 3.3.0 (can't use 3.3.1 dues to its NFS issues) on 64-bit linux (ubuntu lucid). Gluster appears to be working, but won't mount on boot by any means I've tried, and it's still logging prodigious amounts of incomprehensible rubbish (to me!). > > gluster says everything is ok: > > gluster volume status > Status of volume: shared > Gluster process Port Online Pid > ------------------------------------------------------------------------------ > Brick 192.168.1.10:/var/shared 24009 Y 3097 > Brick 192.168.1.11:/var/shared 24009 Y 3020 > NFS Server on localhost 38467 Y 3103 > Self-heal Daemon on localhost N/A Y 3109 > NFS Server on 192.168.1.11 38467 Y 3057 > Self-heal Daemon on 192.168.1.11 N/A Y 3096 > > (other node says the same thing with IPs the other way around) > > Yet the logs tell a different story. > > In syslog, this happens every second: > > Jul 3 00:17:29 web1 init: glusterd main process (14958) terminated with status 255 > Jul 3 00:17:29 web1 init: glusterd main process ended, respawning > This seems like the init system is trying to restart glusterd. Glusterd is a daemon process which is spawned by the process launched by init. Init might be thinking that the main process dying as glusterd dying and try to restart it. But since glusterd is already running you are getting the below logs. What packages are you using and what distro is this on? > In /var/log/glusterfs/etc-glusterfs-glusterd.vol.log I have lots of this: > > [2013-07-03 14:24:08.350429] I [glusterfsd.c:1666:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.3.0 > [2013-07-03 14:24:08.350592] E [glusterfsd.c:1296:glusterfs_pidfile_setup] 0-glusterfsd: pidfile /var/run/glusterd.pid lock error (Resource temporarily unavailable) > > In /var/log/glusterfs/glustershd.log, every minute I get hundreds of these: > > 2013-07-03 14:24:00.792751] I [afr-self-heal-common.c:2159:afr_self_heal_completion_cbk] 0-shared-replicate-0: background meta-data self-heal completed on <gfid:16adce4d-1933-485f-8359-66c47c757cd3> > [2013-07-03 14:24:00.794251] I [afr-common.c:1340:afr_launch_self_heal] 0-shared-replicate-0: background meta-data self-heal triggered. path: <gfid:52bd33a0-0df8-408a-bf6f-c5b6a48c4bd3>, reason: lookup detected pending operations > [2013-07-03 14:24:00.796411] I [afr-self-heal-common.c:2159:afr_self_heal_completion_cbk] 0-shared-replicate-0: background meta-data self-heal completed on <gfid:52bd33a0-0df8-408a-bf6f-c5b6a48c4bd3> > > 'gluster volume heal shared info' says: > > Heal operation on volume shared has been successful > > Brick 192.168.1.10:/var/shared > Number of entries: 335 > ... > > I'm not clear whether this means it has 335 files still to fix, or whether it's done so already. This means there are 335 files still to fix. > Both servers are logging the same kind of stuff. I'm sure all these are related since they happen at about the same rate. > > The lock error looks the most interesting, but I've no idea why that should happen. As before, I've tried deleting all traces of gluster, reinstalling and reconfiguring and putting all the data back on, but nothing changes. > > Here's the command I used to create the volume: > > gluster volume create shared replica 2 transport tcp 192.168.1.10:/var/shared 192.168.1.11:/var/shared > > Here's the volume file it created: > > +------------------------------------------------------------------------------+ > 1: volume shared-posix > 2: type storage/posix > 3: option directory /var/shared > 4: option volume-id 2600e26c-b6c4-448f-a6f6-ad27c14745a0 > 5: end-volume > 6: > 7: volume shared-access-control > 8: type features/access-control > 9: subvolumes shared-posix > 10: end-volume > 11: > 12: volume shared-locks > 13: type features/locks > 14: subvolumes shared-access-control > 15: end-volume > 16: > 17: volume shared-io-threads > 18: type performance/io-threads > 19: subvolumes shared-locks > 20: end-volume > 21: > 22: volume shared-index > 23: type features/index > 24: option index-base /var/shared/.glusterfs/indices > 25: subvolumes shared-io-threads > 26: end-volume > 27: > 28: volume shared-marker > 29: type features/marker > 30: option volume-uuid 2600e26c-b6c4-448f-a6f6-ad27c14745a0 > 31: option timestamp-file /var/lib/glusterd/vols/shared/marker.tstamp > 32: option xtime off > 33: option quota off > 34: subvolumes shared-index > 35: end-volume > 36: > 37: volume /var/shared > 38: type debug/io-stats > 39: option latency-measurement off > 40: option count-fop-hits off > 41: subvolumes shared-marker > 42: end-volume > 43: > 44: volume shared-server > 45: type protocol/server > 46: option transport-type tcp > 47: option auth.login./var/shared.allow 94017411-d986-48e4-a7ac-47c1db14fba0 > 48: option auth.login.94017411-d986-48e4-a7ac-47c1db14fba0.password 3929acf9-fcf1-4684-b271-07927d375c9b > 49: option auth.addr./var/shared.allow * > 50: subvolumes /var/shared > 51: end-volume > > Despite all this, I've not seen gluster do anything visibly wrong - if I create a file on the shared volume it appears on the other node, checksums match, clients can read etc, but I don't want to be running on luck! It's all very troubling, and it's making a right mess of my new distributed logging system... > > Any ideas? > > Marcus > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-users