One client log file is here: http://goo.gl/FyYfy On the server side, on bs1 & bs4, there is a huge, current nfs.log file (odd since I neither wanted nor configured an nfs export). It is filled entirely with these lines: tail -5 nfs.log [2012-06-19 21:11:54.402567] E [rdma.c:4458:tcp_connect_finish] 0-gl-client-1: tcp connect to failed (Connection refused) [2012-06-19 21:11:54.406023] E [rdma.c:4458:tcp_connect_finish] 0-gl-client-2: tcp connect to failed (Connection refused) [2012-06-19 21:11:54.409486] E [rdma.c:4458:tcp_connect_finish] 0-gl-client-3: tcp connect to failed (Connection refused) [2012-06-19 21:11:54.412822] E [rdma.c:4458:tcp_connect_finish] 0-gl-client-6: tcp connect to 10.2.7.11:24008 failed (Connection refused) [2012-06-19 21:11:54.416231] E [rdma.c:4458:tcp_connect_finish] 0-gl-client-7: tcp connect to 10.2.7.11:24008 failed (Connection refused) on servers bs2, bs3 there is a current, huge log of this line, repeating every 3s: [2012-06-19 21:14:00.907387] I [socket.c:1798:socket_event_handler] 0-transport: disconnecting now I was reminded as I was copying it that the client and servers are slightly different - the client is "3.3.0qa42-1" while the server is "3.3.0-1". Is this enough version skew to cause a difference? There are no other problems that I'm aware of but if it's the case that a slight version skew will be problematic, I'll be careful to keep them exactly aligned. I think this was done since the final release binary did not support the glibc that we were usin gon the compute nodes and the 3.3.0qa42-1 did. Perhaps too sloppy...? gluster volume info Volume Name: gl Type: Distribute Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332 Status: Started Number of Bricks: 8 Transport-type: tcp,rdma Bricks: Brick1: bs2:/raid1 Brick2: bs2:/raid2 Brick3: bs3:/raid1 Brick4: bs3:/raid2 Brick5: bs4:/raid1 Brick6: bs4:/raid2 Brick7: bs1:/raid1 Brick8: bs1:/raid2 Options Reconfigured: performance.io-cache: on performance.quick-read: on performance.io-thread-count: 64 auth.allow: 10.2.*.*,10.1.*.* gluster volume status Status of volume: gl Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick bs2:/raid1 24009 Y 2908 Brick bs2:/raid2 24011 Y 2914 Brick bs3:/raid1 24009 Y 2860 Brick bs3:/raid2 24011 Y 2866 Brick bs4:/raid1 24009 Y 2992 Brick bs4:/raid2 24011 Y 2998 Brick bs1:/raid1 24013 Y 10122 Brick bs1:/raid2 24015 Y 10154 NFS Server on localhost 38467 Y 9475 NFS Server on 10.2.7.11 38467 Y 10160 NFS Server on bs2 38467 N N/A NFS Server on bs3 38467 N N/A Hmm sure enough, bs1 and bs4 (localhost in the above info) appear to be running NFS servers, while bs2 & bs3 are not...? OK - after some googling, the gluster nfs serive can be shut off with gluster volume set gl nfs.disable on and now the status looks like this: gluster volume status Status of volume: gl Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick bs2:/raid1 24009 Y 2908 Brick bs2:/raid2 24011 Y 2914 Brick bs3:/raid1 24009 Y 2860 Brick bs3:/raid2 24011 Y 2866 Brick bs4:/raid1 24009 Y 2992 Brick bs4:/raid2 24011 Y 2998 Brick bs1:/raid1 24013 Y 10122 Brick bs1:/raid2 24015 Y 10154 hjm On Tue, 2012-06-19 at 13:05 -0700, Anand Avati wrote: > Can you post the complete logs? Is the 'Too many levels of symbolic > links' (or ELOOP) logs seen in the client log or brick logs? > > > Avati > > On Tue, Jun 19, 2012 at 11:22 AM, harry mangalam > <hjmangalam at gmail.com> wrote: > (Apologies if this already posted, but I recently had to > change smtp servers > which scrambled some list permissions, and I haven't seen it > post) > > I set up a 3.3 gluster volume for another sysadmin and he has > added it > to his cluster via automount. It seems to work initially but > after some > time (days) he is now regularly seeing this warning: > "Too many levels of symbolic links" > when he tries to traverse the mounted filesystems. > > $ df: `/share/gl': Too many levels of symbolic links > > It's supposed to be mounted on /share/gl with a symlink to /gl > ie: /gl -> /share/gl > > I've been using gluster with static mounts on a cluster and > have never > seen this behavior; google does not seem to record anyone else > seeing > this with gluster. However, I note that the "Howto Automount > GlusterFS" > page at > http://www.gluster.org/community/documentation/index.php/Howto_Automount_GlusterFS > has been deleted. Is automounting no longer supported? > > His auto.master file is as follows (sorry for the wrapping): > > w1 > -rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async 10.1.50.2:/& > w2 > -rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async 10.1.50.3:/& > mathbio > -rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async 10.1.50.2:/& > tw > -rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async 10.1.50.4:/& > shwstore > -rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async > shwraid.biomol.uci.edu:/& > djtstore > -rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async > djtraid.biomol.uci.edu:/& > djtstore2 > -rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async > djtraid2.biomol.uci.edu:/djtraid2:/& > djtstore3 > -rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async > djtraid3.biomol.uci.edu:/djtraid3:/& > kevin > -rw,intr,bg,rsize=65520,wsize=65520,retrans=10,timeo=20,hard,lock,defaults,noatime,async 10.2.255.230:/& > samlab > -rw,intr,bg,rsize=65520,wsize=65520,retrans=10,timeo=20,hard,lock,defaults,noatime,async 10.2.255.237:/& > new-data > -rw,intr,bg,rsize=65520,wsize=65520,retrans=10,timeo=20,hard,lock,defaults,noatime,async nas-1-1.ib:/& > gl -fstype=glusterfs > bs1:/& > > > He has never seen this behavior with the other automounted > fs's. The > system logs from the affected nodes do not have any gluster > strings that > appear to be relevant, but /var/log/glusterfs/share-gl.log > ends with > this series of odd lines: > > [2012-06-18 08:57:38.964243] I > [client-handshake.c:453:client_set_lk_version_cbk] > 0-gl-client-6: Server > lk version = 1 > [2012-06-18 08:57:38.964507] I [fuse-bridge.c:3376:fuse_init] > 0-glusterfs-fuse: FUSE inited with protocol versions: > glusterfs 7.13 > kernel 7.16 > [2012-06-18 09:16:48.692701] W > [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4: > remote > operation failed: Stale NFS file handle. > Path: /tdlong/RILseq/makebam.commands > (90193380-d107-4b6c-b02f-ab53a0f65148) > [2012-06-18 09:16:48.693030] W > [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4: > remote > operation failed: Stale NFS file handle. > Path: /tdlong/RILseq/makebam.commands > (90193380-d107-4b6c-b02f-ab53a0f65148) > [2012-06-18 09:16:48.693165] W > [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4: > remote > operation failed: Stale NFS file handle. > Path: /tdlong/RILseq/makebam.commands > (90193380-d107-4b6c-b02f-ab53a0f65148) > [2012-06-18 09:16:48.693394] W > [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4: > remote > operation failed: Stale NFS file handle. > Path: /tdlong/RILseq/makebam.commands > (90193380-d107-4b6c-b02f-ab53a0f65148) > [2012-06-18 10:56:32.756551] I > [fuse-bridge.c:4037:fuse_thread_proc] > 0-fuse: unmounting /share/gl > [2012-06-18 10:56:32.757148] W > [glusterfsd.c:816:cleanup_and_exit] > (-->/lib64/libc.so.6(clone+0x6d) [0x3829ed44bd] > (-->/lib64/libpthread.so.0 [0x382aa0673d] > (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0x17c) > [0x40524c]))) 0-: > received signum (15), shutting down > > Any hints as to why this is happening? > > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > >