Not real confident in 3.3

sean at gcnpublishing.com (Sean Fulton) · Sat, 16 Jun 2012 13:48:55 -0400

I do not mean to be argumentative, but I have to admit a little 
frustration with Gluster. I know an enormous emount of effort has gone 
into this product, and I just can't believe that with all the effort 
behind it and so many people using it, it could be so fragile.

So here goes. Perhaps someone here can point to the error of my ways. I 
really want this to work because it would be ideal for our environment, 
but ...

Please note that all of the nodes below are OpenVZ nodes with 
nfs/nfsd/fuse modules loaded on the hosts.

After spending months trying to get 3.2.5 and 3.2.6 working in a 
production environment, I gave up on Gluster and went with a 
Linux-HA/NFS cluster which just works. The problems I had with gluster 
were strange lock-ups, split brains, and too many instances where the 
whole cluster was off-line until I reloaded the data.

So wiith the release of 3.3, I decided to give it another try. I created 
one relicated volume on my two NFS servers.

I then mounted the volume on a client as follows:
10.10.10.7:/pub2    /pub2     nfs 
rw,noacl,noatime,nodiratime,soft,proto=tcp,vers=3,defaults 0 0

I threw some data at it (find / -mount -print | cpio -pvdum /pub2/test)

Within 10 seconds it locked up solid. No error messages on any of the 
servers, the client was unresponsive and load on the client was 15+. I 
restarted glusterd on both of my NFS servers, and the client remained 
locked. Finally I killed the cpio process on the client. When I started 
another cpio, it runs further than before, but now the logs on my 
NFS/Gluster server say:

[2012-06-16 13:37:35.242754] I 
[afr-self-heal-common.c:1318:afr_sh_missing_entries_lookup_done] 
0-pub2-replicate-0: No sources for dir of 
<gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure, in missing entry 
self-heal, continuing with the rest of the self-heals
[2012-06-16 13:37:35.243315] I 
[afr-self-heal-common.c:994:afr_sh_missing_entries_done] 
0-pub2-replicate-0: split brain found, aborting selfheal of 
<gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure
[2012-06-16 13:37:35.243350] E 
[afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 
0-pub2-replicate-0: background  data gfid self-heal failed on 
<gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure

This still seems to be an INCREDIBLY fragile system. Why would it lock 
solid while copying a large file? Why no errors in the logs?

I am the only one seeing this kind of behavior?

sean

-- 
Sean Fulton
GCN Publishing, Inc.
Internet Design, Development and Consulting For Today's Media Companies
http://www.gcnpublishing.com
(203) 665-6211, x203