Not real confident in 3.3

anand.avati at gmail.com (Anand Avati) · Sat, 16 Jun 2012 14:04:00 -0700

Was there anything in dmesg on the servers? If you are able to reproduce
the hang, can you get the output of 'gluster volume status <name> callpool'
and 'gluster volume status <name> nfs callpool' ?

How big is the 'log/secure' file? Is it so large the the client was just
busy writing it for a very long time? Are there any signs of disconnections
or ping tmeouts in the logs?

Avati

On Sat, Jun 16, 2012 at 10:48 AM, Sean Fulton <sean at gcnpublishing.com>wrote:

> I do not mean to be argumentative, but I have to admit a little
> frustration with Gluster. I know an enormous emount of effort has gone into
> this product, and I just can't believe that with all the effort behind it
> and so many people using it, it could be so fragile.
>
> So here goes. Perhaps someone here can point to the error of my ways. I
> really want this to work because it would be ideal for our environment, but
> ...
>
> Please note that all of the nodes below are OpenVZ nodes with
> nfs/nfsd/fuse modules loaded on the hosts.
>
> After spending months trying to get 3.2.5 and 3.2.6 working in a
> production environment, I gave up on Gluster and went with a Linux-HA/NFS
> cluster which just works. The problems I had with gluster were strange
> lock-ups, split brains, and too many instances where the whole cluster was
> off-line until I reloaded the data.
>
> So wiith the release of 3.3, I decided to give it another try. I created
> one relicated volume on my two NFS servers.
>
> I then mounted the volume on a client as follows:
> 10.10.10.7:/pub2    /pub2     nfs rw,noacl,noatime,nodiratime,**soft,proto=tcp,vers=3,defaults
> 0 0
>
> I threw some data at it (find / -mount -print | cpio -pvdum /pub2/test)
>
> Within 10 seconds it locked up solid. No error messages on any of the
> servers, the client was unresponsive and load on the client was 15+. I
> restarted glusterd on both of my NFS servers, and the client remained
> locked. Finally I killed the cpio process on the client. When I started
> another cpio, it runs further than before, but now the logs on my
> NFS/Gluster server say:
>
> [2012-06-16 13:37:35.242754] I [afr-self-heal-common.c:1318:**
> afr_sh_missing_entries_lookup_**done] 0-pub2-replicate-0: No sources for
> dir of <gfid:4a787ad7-ab91-46ef-9b31-**715e49f5f818>/log/secure, in
> missing entry self-heal, continuing with the rest of the self-heals
> [2012-06-16 13:37:35.243315] I [afr-self-heal-common.c:994:**afr_sh_missing_entries_done]
> 0-pub2-replicate-0: split brain found, aborting selfheal of
> <gfid:4a787ad7-ab91-46ef-9b31-**715e49f5f818>/log/secure
> [2012-06-16 13:37:35.243350] E [afr-self-heal-common.c:2156:**afr_self_heal_completion_cbk]
> 0-pub2-replicate-0: background  data gfid self-heal failed on
> <gfid:4a787ad7-ab91-46ef-9b31-**715e49f5f818>/log/secure
>
>
> This still seems to be an INCREDIBLY fragile system. Why would it lock
> solid while copying a large file? Why no errors in the logs?
>
> I am the only one seeing this kind of behavior?
>
> sean
>
>
>
>
>
> --
> Sean Fulton
> GCN Publishing, Inc.
> Internet Design, Development and Consulting For Today's Media Companies
> http://www.gcnpublishing.com
> (203) 665-6211, x203
>
> ______________________________**_________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/**mailman/listinfo/gluster-users<http://gluster.org/cgi-bin/mailman/listinfo/gluster-users>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gluster.org/pipermail/gluster-users/attachments/20120616/76ac571d/attachment.htm>