Problems with gluster and autofs

philp at layer3.co.uk (Phil Packer) · Tue, 9 Feb 2010 10:44:49 +0000

Hi, 

I have a working glusterfs setup running on Centos 5.3 with

glusterfs-2.0.4 (compiled from the source RPM)
fuse-2.7.4-1
dkms-fuse-2.7.4-1.rf
autofs-5.0.1-0.rc2.102
kernel  2.6.18-128.1.10.el5 

and this all works just fine - autofs mounts the file system as you would expect and this has been in production for some time.

However, if I try and upgrade any of the components, it breaks in that the autofs mount will hang rather than completing the mount.

Mounting the file system by hand with an explicit mount command always works correctly.

I've tried several versions of glusterfs later than the above including the latest 3.0.2-1 with exactly the same result.

Additionally keeping that version of gluster and updating any of the other components also seems to break it, although I've not been able to test all the combinations - certainly the following set doesn't work either:

glusterfs-3.0.2-1
dkms-fuse-2.7.4-1.nodist.rf
fuse-2.7.4-8.el5
autofs-5.0.1-0.rc2.131.el5_4.1
2.6.18-164.11.1.el5

I wonder if someone on the list can help me, as I've seen nothing in bugzilla relating to this.

Relevant information follows (for my test rig only)

Server volfile is:

---snip---
[l3admin at oy-centos-5_3-buildserver glusterfs]$ cat /etc/glusterfs/glusterfsd.vol
## Export volume "images-brick" with the contents of /export/images directory
volume posix
type storage/posix
option directory /export/shared/
end-volume

volume locks
 type features/locks
 subvolumes posix
end-volume

volume server
 type protocol/server
 option transport-type tcp/server 
 subvolumes locks
 option auth.addr.locks.allow *
end-volume 
---snip---

Client volfile:

---snip---
[l3admin at oy-centos-5_3-buildserver glusterfs]$ cat /etc/glusterfs/glusterfs.vol
volume oy-centos-5_3-buildserver
 type protocol/client
 option transport-type tcp/client
 option remote-host 127.0.0.1
 option remote-subvolume locks
end-volume
---snip---

/etc/auto.master has the following:

---snip---
/mnt/auto /etc/auto.d/auto.gluster --timeout=60 --ghost
---snip---

and auto.gluster has

---snip---
# Mount the glustered file system
shared -fstype=glusterfs :/etc/glusterfs/glusterfs.vol
---snip---

Mounting the gluster file system directly works fine:

---snip---
sudo mount -t glusterfs /etc/glusterfs/glusterfs.vol /mnt/auto/shared/
[l3admin at oy-centos-5_3-buildserver ~]$ df /mnt/auto/shared
Filesystem           1K-blocks      Used Available Use% Mounted on
glusterfs#/etc/glusterfs/glusterfs.vol
                      2031360    543744   1382656  29% /mnt/auto/shared
---snip---

starting autofs and attempting to access the mounted directory eg 

ls /mnt/auto/shared/

causes the glusterfs to hang leaving a process list like this:

[l3admin at oy-centos-5_3-buildserver glusterfs]$ pstree -p | grep glu
       |-automount(22819)-+-mount(22830)---mount.glusterfs(22831)---glusterfs(22880)---glusterfs(22881)
       |-glusterfs(22882)---{glusterfs}(22883)
       |-glusterfsd(22356)---{glusterfsd}(22357)

Running gdb against 22882 during the hang shows:

(gdb) bt
#0  0x00c81402 in __kernel_vsyscall ()
#1  0x00ee7473 in __xstat64 at GLIBC_2.1 () from /lib/libc.so.6
#2  0x00df66ec in stat64 () from /usr/lib/glusterfs/3.0.0/xlator/mount/fuse.so
#3  0x00df419c in init (this_xl=0x88db1f8) at fuse-bridge.c:3368
#4  0x00b2293d in xlator_init (xl=0x88db1f8) at xlator.c:940
#5  0x00b22583 in xlator_init_rec (xl=0x88db1f8) at xlator.c:833
#6  0x00b226e6 in xlator_tree_init (xl=0x88db1f8) at xlator.c:871
#7  0x0804b299 in _xlator_graph_init ()
#8  0x0804b433 in glusterfs_graph_init ()
#9  0x0804d40c in main ()
(gdb) directory /home/l3admin/rpmbuild/BUILD/glusterfs-3.0.0/glusterfsd/src
Source directories searched: /home/l3admin/rpmbuild/BUILD/glusterfs-3.0.0/glusterfsd/src:$cdir:$cwd

(gdb) list *0x00df419c
0xdf419c is in init (fuse-bridge.c:3368).
3363                    gf_log ("fuse", GF_LOG_ERROR,
3364                            "Mandatory option 'mountpoint' is not specified.");
3365                    goto cleanup_exit;
3366            }
3367
3368            if (stat (value_string, &stbuf) != 0) {
3369                    if (errno == ENOENT) {
3370                            gf_log (this_xl->name, GF_LOG_ERROR,
3371                                    "%s %s does not exist",
3372                                    ZR_MOUNTPOINT_OPT, value_string);
(gdb) 

(gdb) select-frame 3
(gdb) print value_string
$1 = 0x88da198 "/mnt/auto/shared"

By the time I'd got here, the spawned process on 22883 had died (is there a watchdog of some sort?) so I repeated the exercise and ran gdb on the watchdog process (which I think was pid 22881) getting this:

(gdb) bt
#0  0x0013d402 in __kernel_vsyscall ()
#1  0x001cf996 in nanosleep () from /lib/libc.so.6
#2  0x0020915c in usleep () from /lib/libc.so.6
#3  0x00d7fa2d in gf_timer_proc (ctx=0x8560008) at timer.c:177
#4  0x0068573b in start_thread () from /lib/libpthread.so.0
#5  0x0020fcfe in clone () from /lib/libc.so.6

And so I presume that this is waiting for some communication from the process which spawned it, indicating that the mount was complete???

Regards to all

Phil

-- 
Director, Layer3 Systems Ltd
Layer3 Systems Limited is registered in England.  Company no 3130393
43 Pendle Road, Streatham, London, SW16 6RT
tel: 020 8769 4484