On 09/18/2009 03:37 PM, Mark Mielke wrote: > On 09/18/2009 02:28 PM, Anand Avati wrote: >>> For me, it does not clear after 3 mins or 3 hours. I restarted the >>> machines >>> at midnight, and the first time I tried again was around 1pm the >>> next day >>> (13 hours). I easily recognize the symptoms as the /bin/mount >>> remains in the >>> process tree. I can't get a strace -p on the /bin/mount process >>> since it is >>> frozen. The glusterfsd process is not frozen - the glusterfs process >>> seems >>> to be waiting on /bin/mount to complete. The only way to unfreeze >>> the mount >>> seems to be to kill -9 /bin/mount (regular kill does not work), at >>> which the >>> mount point goes into the disconnected state, and it is recovered using >>> unmount / remount. I tried to track down the problem before, but became >>> confused, because glusterfs seems to do it's own FUSE mount management >>> rather than using the standard (for Linux anyways?) FUSE user space >>> libraries. If my memory is correct - it seems like the process is: I >>> run >>> mount, the mount runs /sbin/mount.glusterfs, which runs glusterfs, >>> which >>> runs /bin/mount with the full options? >> This looks like a different issue from what I previously described. If >> you are certain that the /bin/mount which was hung was the one which >> glusterfs had spawned, then the issue might be something else. The way >> fuse based filesystems mount is 2-fold. The first 'mount -t glusterfs' >> starts /bin/mount which in turn calls /sbin/mount.glusterfs. This >> starts the glusterfs binary, which at the time of initializing the >> fuse xlator results in a call to fuse_mount() call of libfuse. libfuse >> in-turn does the second phase of mounting by calling mount -t fuse an >> in turn /sbin/mount.fuse. I'm trying to think how the three machines >> rebooting together can be correlated to the second phase fuse mount to >> hang. > > Thanks for looking at this. The above is compatible with my thinking. > I'll see about getting output to prove it. [root at wcarh033]~# ps -ef | grep gluster root 1548 1 0 21:00 ? 00:00:00 /opt/glusterfs/sbin/glusterfsd -f /etc/glusterfs/glusterfsd.vol root 1861 1 0 21:00 ? 00:00:00 /opt/glusterfs/sbin/glusterfs --log-level=NORMAL --volfile=/etc/glusterfs/tools.vol /gluster/tools root 1874 1861 0 21:00 ? 00:00:00 /bin/mount -i -f -t fuse.glusterfs -o rw,allow_other,default_permissions,max_read=131072 /etc/glusterfs/tools.vol /gluster/tools root 2426 2395 0 21:02 pts/2 00:00:00 grep gluster [root at wcarh033]~# ls /gluster/tools ^C^C Yep - all three nodes locked up. All it took was a simultaneous reboot of all three machines. After I kill -9 1874 (kill 1874 without -9 has no effect) from a different ssh session, I get: ls: cannot access /gluster/tools: Transport endpoint is not connected After this, mount works (unmount not necessary it turns out). I am unable to strace -p the mount -t fuse without it freezing up. I can pstack, but it returns 0 lines of output fairly quickly. The symptoms are identical on all three machines. 3-way replication, each server has both a server exposing one volume, and a client, with cluster/replication and a preferred read of the local server. Cheers, mark -- Mark Mielke<mark at mielke.cc>