Was there anything in dmesg on the servers? If you are able to reproduce the hang, can you get the output of 'gluster volume status <name> callpool' and 'gluster volume status <name> nfs callpool' ? How big is the 'log/secure' file? Is it so large the the client was just busy writing it for a very long time? Are there any signs of disconnections or ping tmeouts in the logs? Avati On Sat, Jun 16, 2012 at 10:48 AM, Sean Fulton <sean at gcnpublishing.com>wrote: > I do not mean to be argumentative, but I have to admit a little > frustration with Gluster. I know an enormous emount of effort has gone into > this product, and I just can't believe that with all the effort behind it > and so many people using it, it could be so fragile. > > So here goes. Perhaps someone here can point to the error of my ways. I > really want this to work because it would be ideal for our environment, but > ... > > Please note that all of the nodes below are OpenVZ nodes with > nfs/nfsd/fuse modules loaded on the hosts. > > After spending months trying to get 3.2.5 and 3.2.6 working in a > production environment, I gave up on Gluster and went with a Linux-HA/NFS > cluster which just works. The problems I had with gluster were strange > lock-ups, split brains, and too many instances where the whole cluster was > off-line until I reloaded the data. > > So wiith the release of 3.3, I decided to give it another try. I created > one relicated volume on my two NFS servers. > > I then mounted the volume on a client as follows: > 10.10.10.7:/pub2 /pub2 nfs rw,noacl,noatime,nodiratime,**soft,proto=tcp,vers=3,defaults > 0 0 > > I threw some data at it (find / -mount -print | cpio -pvdum /pub2/test) > > Within 10 seconds it locked up solid. No error messages on any of the > servers, the client was unresponsive and load on the client was 15+. I > restarted glusterd on both of my NFS servers, and the client remained > locked. Finally I killed the cpio process on the client. When I started > another cpio, it runs further than before, but now the logs on my > NFS/Gluster server say: > > [2012-06-16 13:37:35.242754] I [afr-self-heal-common.c:1318:** > afr_sh_missing_entries_lookup_**done] 0-pub2-replicate-0: No sources for > dir of <gfid:4a787ad7-ab91-46ef-9b31-**715e49f5f818>/log/secure, in > missing entry self-heal, continuing with the rest of the self-heals > [2012-06-16 13:37:35.243315] I [afr-self-heal-common.c:994:**afr_sh_missing_entries_done] > 0-pub2-replicate-0: split brain found, aborting selfheal of > <gfid:4a787ad7-ab91-46ef-9b31-**715e49f5f818>/log/secure > [2012-06-16 13:37:35.243350] E [afr-self-heal-common.c:2156:**afr_self_heal_completion_cbk] > 0-pub2-replicate-0: background data gfid self-heal failed on > <gfid:4a787ad7-ab91-46ef-9b31-**715e49f5f818>/log/secure > > > This still seems to be an INCREDIBLY fragile system. Why would it lock > solid while copying a large file? Why no errors in the logs? > > I am the only one seeing this kind of behavior? > > sean > > > > > > -- > Sean Fulton > GCN Publishing, Inc. > Internet Design, Development and Consulting For Today's Media Companies > http://www.gcnpublishing.com > (203) 665-6211, x203 > > ______________________________**_________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/**mailman/listinfo/gluster-users<http://gluster.org/cgi-bin/mailman/listinfo/gluster-users> > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://gluster.org/pipermail/gluster-users/attachments/20120616/76ac571d/attachment.htm>