"Too many levels of symbolic links" with glusterfs automounting

hjmangalam at gmail.com (harry mangalam) · Tue, 19 Jun 2012 21:23:05 -0700

One client log file is here:
http://goo.gl/FyYfy

On the server side, on bs1 & bs4, there is a huge, current nfs.log file
(odd since I neither wanted nor configured an nfs export).  It is filled
entirely with these lines:
 tail -5 nfs.log
[2012-06-19 21:11:54.402567] E [rdma.c:4458:tcp_connect_finish]
0-gl-client-1: tcp connect to  failed (Connection refused)
[2012-06-19 21:11:54.406023] E [rdma.c:4458:tcp_connect_finish]
0-gl-client-2: tcp connect to  failed (Connection refused)
[2012-06-19 21:11:54.409486] E [rdma.c:4458:tcp_connect_finish]
0-gl-client-3: tcp connect to  failed (Connection refused)
[2012-06-19 21:11:54.412822] E [rdma.c:4458:tcp_connect_finish]
0-gl-client-6: tcp connect to 10.2.7.11:24008 failed (Connection
refused)
[2012-06-19 21:11:54.416231] E [rdma.c:4458:tcp_connect_finish]
0-gl-client-7: tcp connect to 10.2.7.11:24008 failed (Connection
refused)

on servers bs2, bs3 there is a current, huge log of this line, repeating
every 3s:

[2012-06-19 21:14:00.907387] I [socket.c:1798:socket_event_handler]
0-transport: disconnecting now

I was reminded as I was copying it that the client and servers are
slightly different - the client is "3.3.0qa42-1" while the server is
"3.3.0-1".  Is this enough version skew to cause a difference?  There
are no other problems that I'm aware of but if it's the case that a
slight version skew will be problematic, I'll be careful to keep them
exactly aligned.  I think this was done since the final release binary
did not support the glibc that we were usin gon the compute nodes and
the 3.3.0qa42-1 did.  Perhaps too sloppy...?

gluster volume info

Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64
auth.allow: 10.2.*.*,10.1.*.*

gluster volume status
Status of volume: gl
Gluster process                                         Port    Online
Pid
------------------------------------------------------------------------------
Brick bs2:/raid1                                        24009   Y
2908
Brick bs2:/raid2                                        24011   Y
2914
Brick bs3:/raid1                                        24009   Y
2860
Brick bs3:/raid2                                        24011   Y
2866
Brick bs4:/raid1                                        24009   Y
2992
Brick bs4:/raid2                                        24011   Y
2998
Brick bs1:/raid1                                        24013   Y
10122
Brick bs1:/raid2                                        24015   Y
10154
NFS Server on localhost                                 38467   Y
9475
NFS Server on 10.2.7.11                                 38467   Y
10160
NFS Server on bs2                                       38467   N
N/A
NFS Server on bs3                                       38467   N
N/A

Hmm sure enough, bs1 and bs4 (localhost in the above info) appear to be
running NFS servers, while bs2 & bs3 are not...?  

OK - after some googling, the gluster nfs serive can be shut off with 
gluster volume set gl nfs.disable on

and now the status looks like this:

gluster volume status
Status of volume: gl
Gluster process                                         Port    Online
Pid
------------------------------------------------------------------------------
Brick bs2:/raid1                                        24009   Y
2908
Brick bs2:/raid2                                        24011   Y
2914
Brick bs3:/raid1                                        24009   Y
2860
Brick bs3:/raid2                                        24011   Y
2866
Brick bs4:/raid1                                        24009   Y
2992
Brick bs4:/raid2                                        24011   Y
2998
Brick bs1:/raid1                                        24013   Y
10122
Brick bs1:/raid2                                        24015   Y
10154

hjm

On Tue, 2012-06-19 at 13:05 -0700, Anand Avati wrote:
> Can you post the complete logs? Is the 'Too many levels of symbolic
> links' (or ELOOP) logs seen in the client log or brick logs?
> 
> 
> Avati
> 
> On Tue, Jun 19, 2012 at 11:22 AM, harry mangalam
> <hjmangalam at gmail.com> wrote:
>         (Apologies if this already posted, but I recently had to
>         change smtp servers
>         which scrambled some list permissions, and I haven't seen it
>         post)
>         
>         I set up a 3.3 gluster volume for another sysadmin and he has
>         added it
>         to his cluster via automount.  It seems to work initially but
>         after some
>         time (days) he is now regularly seeing this warning:
>         "Too many levels of symbolic links"
>         when he tries to traverse the mounted filesystems.
>         
>         $ df: `/share/gl': Too many levels of symbolic links
>         
>         It's supposed to be mounted on /share/gl with a symlink to /gl
>         ie:  /gl -> /share/gl
>         
>         I've been using gluster with static mounts on a cluster and
>         have never
>         seen this behavior; google does not seem to record anyone else
>         seeing
>         this with gluster. However, I note that the "Howto Automount
>         GlusterFS"
>         page at
>         http://www.gluster.org/community/documentation/index.php/Howto_Automount_GlusterFS
>         has been deleted. Is automounting no longer supported?
>         
>         His auto.master file is as follows (sorry for the wrapping):
>         
>            w1
>         -rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async  10.1.50.2:/&
>            w2
>         -rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async  10.1.50.3:/&
>            mathbio
>         -rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async  10.1.50.2:/&
>            tw
>         -rw,intr,bg,v3,rsize=16384,wsize=16384,retrans=10,timeo=20,hard,lock,defaults,noatime,async  10.1.50.4:/&
>            shwstore
>         -rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async
>         shwraid.biomol.uci.edu:/&
>            djtstore
>         -rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async
>         djtraid.biomol.uci.edu:/&
>            djtstore2
>         -rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async
>         djtraid2.biomol.uci.edu:/djtraid2:/&
>            djtstore3
>         -rw,intr,bg,v3,rsize=16384,wsize=16384,lock,defaults,noatime,async
>         djtraid3.biomol.uci.edu:/djtraid3:/&
>            kevin
>         -rw,intr,bg,rsize=65520,wsize=65520,retrans=10,timeo=20,hard,lock,defaults,noatime,async  10.2.255.230:/&
>            samlab
>         -rw,intr,bg,rsize=65520,wsize=65520,retrans=10,timeo=20,hard,lock,defaults,noatime,async  10.2.255.237:/&
>            new-data
>         -rw,intr,bg,rsize=65520,wsize=65520,retrans=10,timeo=20,hard,lock,defaults,noatime,async  nas-1-1.ib:/&
>            gl        -fstype=glusterfs
>         bs1:/&
>         
>         
>         He has never seen this behavior with the other automounted
>         fs's.  The
>         system logs from the affected nodes do not have any gluster
>         strings that
>         appear to be relevant, but /var/log/glusterfs/share-gl.log
>         ends with
>         this series of odd lines:
>         
>         [2012-06-18 08:57:38.964243] I
>         [client-handshake.c:453:client_set_lk_version_cbk]
>         0-gl-client-6: Server
>         lk version = 1
>         [2012-06-18 08:57:38.964507] I [fuse-bridge.c:3376:fuse_init]
>         0-glusterfs-fuse: FUSE inited with protocol versions:
>         glusterfs 7.13
>         kernel 7.16
>         [2012-06-18 09:16:48.692701] W
>         [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4:
>         remote
>         operation failed: Stale NFS file handle.
>         Path: /tdlong/RILseq/makebam.commands
>         (90193380-d107-4b6c-b02f-ab53a0f65148)
>         [2012-06-18 09:16:48.693030] W
>         [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4:
>         remote
>         operation failed: Stale NFS file handle.
>         Path: /tdlong/RILseq/makebam.commands
>         (90193380-d107-4b6c-b02f-ab53a0f65148)
>         [2012-06-18 09:16:48.693165] W
>         [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4:
>         remote
>         operation failed: Stale NFS file handle.
>         Path: /tdlong/RILseq/makebam.commands
>         (90193380-d107-4b6c-b02f-ab53a0f65148)
>         [2012-06-18 09:16:48.693394] W
>         [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-gl-client-4:
>         remote
>         operation failed: Stale NFS file handle.
>         Path: /tdlong/RILseq/makebam.commands
>         (90193380-d107-4b6c-b02f-ab53a0f65148)
>         [2012-06-18 10:56:32.756551] I
>         [fuse-bridge.c:4037:fuse_thread_proc]
>         0-fuse: unmounting /share/gl
>         [2012-06-18 10:56:32.757148] W
>         [glusterfsd.c:816:cleanup_and_exit]
>         (-->/lib64/libc.so.6(clone+0x6d) [0x3829ed44bd]
>         (-->/lib64/libpthread.so.0 [0x382aa0673d]
>         (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0x17c)
>         [0x40524c]))) 0-:
>         received signum (15), shutting down
>         
>         Any hints as to why this is happening?
>         
>         
>         
>         
>         _______________________________________________
>         Gluster-users mailing list
>         Gluster-users at gluster.org
>         http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
> 
>