Hi Mark, This sounds a lot like the fast first access bug (Bug 220) that's been an issue since at least 2.0.0rc1. You might want to add your observations into the comments for this bug - http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=220 (The devs seem to prefer handling bugs this way.) Geoff. On Mon, 7 Sep 2009, Mark Mielke wrote: > On 09/06/2009 11:19 PM, Mark Mielke wrote: > > On 09/06/2009 10:42 PM, Mark Mielke wrote: > >> This seems to happen about 50% of the time: > >> > >> [root@wcarh035 ~]# ls /gluster/data > >> ls: cannot open directory /gluster/data: No such file or directory > >> [root@wcarh035 ~]# ls /gluster/data > >> 00 06.fun 15 23.fun 32 40.fun 47 55.fun 64 > >> 00.fun 07 15.fun 24 32.fun 41 47.fun 56 64.fun > >> > >> My current guess is that GlusterFS is saying the mount is complete to > >> AutoFS before the actual mount operation takes effect. 50% of the > >> time GlusterFS is able to complete the mount before AutoFS let's the > >> user continue, and all is well. The other 50% of the time, GlusterFS > >> does not quite finish the mount, and AutoFS gives the user a broken > >> directory. > >> > >> I might try and prove this by adding a sleep 5 to > >> /sbin/mount.glusterfs, although I do not consider this a valid > >> solution, as it just reduces the effect of the race - it does not > >> eliminate the race. > > > > Uhh... Hmm... It already has a "sleep 3", and changing it to "sleep 5" > > does not reduce the frequency of the problem. Changing it to "sleep > > 10" also has no effect. > > > > Why does it sometimes work and sometimes not? > > I note that the fusermount from the FUSE libraries does not seem to have > the same problem: > > $ /stage/linux/fuse-2.7.4/example/fusexmp_fh /tmp/t ; ls /tmp/t > backup/ boot/ etc/ lib64/ media/ pccyber/ sbin/ stage/ > usr/ > backup2/ db/ home/ lost+found/ mnt/ proc/ selinux/ sys/ > var/ > bin/ dev/ lib/ mail/ opt/ root/ srv/ tmp/ > www/ > > It works immediately. Compare this to: > > [root@wcarh033]~# echo hi >/tmp/t/hi > [root@wcarh033]~# time /opt/glusterfs/sbin/glusterfs > --volfile=/etc/glusterfs/gluster-data.vol /tmp/t ; ls /tmp/t ; sleep 1 ; > ls /tmp/t > /opt/glusterfs/sbin/glusterfs --volfile=/etc/glusterfs/gluster-data.vol > /tmp/ 0.00s user 0.00s system 113% cpu 0.003 total > hi > 00 06.fun 15 23.fun 32 40.fun 47 55.fun 64 > 00.fun 07 15.fun 24 32.fun 41 47.fun 56 64.fun > 01 07.fun 16 24.fun 33 41.fun 50 56.fun 65 > 01.fun 10 16.fun 25 33.fun 42 50.fun 57 65.fun > 02 10.fun 17 25.fun 34 42.fun 51 57.fun 66 > 02.fun 11 17.fun 26 34.fun 43 51.fun 60 66.fun > 03 11.fun 20 26.fun 35 43.fun 52 60.fun 67 > 03.fun 12 20.fun 27 35.fun 44 52.fun 61 67.fun > 04 12.fun 21 27.fun 36 44.fun 53 61.fun lost+found > 04.fun 13 21.fun 30 36.fun 45 53.fun 62 > 05 13.fun 22 30.fun 37 45.fun 54 62.fun > 05.fun 14 22.fun 31 37.fun 46 54.fun 63 > 06 14.fun 23 31.fun 40 46.fun 55 63.fun > > Note that the first 'ls' returns 'hi', and a second later, 'ls' returns > the glusterfs content. > > For fusexmp, it appears to complete the mount before it returns. For > glusterfs, it seems to complete the mount a short time after it completes. > > I think this is where autofs is getting confused, and serving the handle > to the directory to the client too early. It thinks glusterfs is done > mounting, and gives the handle to the client, but this handle is broken > and fails. Glusterfs completes the mount, and a short time later the > lookups succeed. Adding 'sleep' in mount.glusterfs do not seem to be > good enough - as 'sleep 1' and 'sleep 20' do not change the frequency. > The existing 'sleep 3' in /sbin/mount.glusterfs should be completely > unnecessary. Instead, we should figure out why GlusterFS cannot ensure > the mount is in place before it returns? > > I'm worn out investigating for today - hopefully somebody can help me? :-) > > Cheers, > mark