Not real confident in 3.3

sean at gcnpublishing.com (Sean Fulton) · Sat, 16 Jun 2012 17:12:19 -0400

Let me re-iterate, I really, really want to see Gluster work for our 
environment. I am hopeful this is something I did or something that can 
be easily fixed.

Yes, there was an error on the client server:

[586898.273283] INFO: task flush-0:45:633954 blocked for more than 120 
seconds.
[586898.273290] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[586898.273295] flush-0:45    D ffff8806037592d0     0 633954 2    0 
0x00000000
[586898.273304]  ffff88000d1ebbe0 0000000000000046 ffff88000d1ebd6c 
0000000000000000
[586898.273312]  ffff88000d1ebce0 ffffffff81054444 ffff88000d1ebc80 
ffff88000d1ebbf0
[586898.273319]  ffff8806050ac5f8 ffff880603759888 ffff88000d1ebfd8 
ffff88000d1ebfd8
[586898.273326] Call Trace:
[586898.273335]  [<ffffffff81054444>] ? find_busiest_group+0x244/0xb20
[586898.273343]  [<ffffffff811ab050>] ? inode_wait+0x0/0x20
[586898.273349]  [<ffffffff811ab05e>] inode_wait+0xe/0x20
[586898.273357]  [<ffffffff814e752f>] __wait_on_bit+0x5f/0x90
[586898.273365]  [<ffffffff811bbd6c>] ? writeback_sb_inodes+0x13c/0x210
[586898.273370]  [<ffffffff811bab28>] inode_wait_for_writeback+0x98/0xc0
[586898.273377]  [<ffffffff81095550>] ? wake_bit_function+0x0/0x50
[586898.273382]  [<ffffffff811bc1f8>] wb_writeback+0x218/0x420
[586898.273389]  [<ffffffff814e637e>] ? thread_return+0x4e/0x7d0
[586898.273394]  [<ffffffff811bc5a9>] wb_do_writeback+0x1a9/0x250
[586898.273402]  [<ffffffff8107e2e0>] ? process_timeout+0x0/0x10
[586898.273407]  [<ffffffff811bc6b3>] bdi_writeback_task+0x63/0x1b0
[586898.273412]  [<ffffffff810953e7>] ? bit_waitqueue+0x17/0xc0
[586898.273419]  [<ffffffff8114ce80>] ? bdi_start_fn+0x0/0x100
[586898.273424]  [<ffffffff8114cf06>] bdi_start_fn+0x86/0x100
[586898.273429]  [<ffffffff8114ce80>] ? bdi_start_fn+0x0/0x100
[586898.273434]  [<ffffffff81094f36>] kthread+0x96/0xa0
[586898.273440]  [<ffffffff8100c20a>] child_rip+0xa/0x20
[586898.273445]  [<ffffffff81094ea0>] ? kthread+0x0/0xa0
[586898.273450]  [<ffffffff8100c200>] ? child_rip+0x0/0x20
[root at server-10 ~]#

Here are the file sizes. Secure was big, but was hung for quite a long time:

-rw------- 1 root root         0 Dec 20 10:17 boot.log
-rw------- 1 root utmp 281079168 Jun 15 21:53 btmp
-rw------- 1 root root    337661 Jun 16 16:36 cron
-rw-r--r-- 1 root root         0 Jun  9 18:33 dmesg
-rw-r--r-- 1 root root         0 Jun  9 16:19 dmesg.old
-rw-r--r-- 1 root root     98585 Dec 21 14:32 dracut.log
drwxr-xr-x 5 root root      4096 Dec 21 16:53 glusterfs
drwx------ 2 root root      4096 Mar  1 16:11 httpd
-rw-r--r-- 1 root root    146000 Jun 16 13:36 lastlog
drwxr-xr-x 2 root root      4096 Dec 20 10:35 mail
-rw------- 1 root root   1072902 Jun  9 18:33 maillog
-rw------- 1 root root     50638 Jun 16 12:13 messages
drwxr-xr-x 2 root root      4096 Dec 30 16:14 nginx
drwx------ 3 root root      4096 Dec 20 10:35 samba
-rw------- 1 root root 222214339 Jun 16 13:37 secure
-rw------- 1 root root         0 Sep 13  2011 spooler
-rw------- 1 root root         0 Sep 13  2011 tallylog
-rw-rw-r-- 1 root utmp    114432 Jun 16 13:37 wtmp
-rw------- 1 root root      7015 Jun 16 12:13 yum.log

On 06/16/2012 05:04 PM, Anand Avati wrote:
> Was there anything in dmesg on the servers? If you are able to 
> reproduce the hang, can you get the output of 'gluster volume status 
> <name> callpool' and 'gluster volume status <name> nfs callpool' ?
>
> How big is the 'log/secure' file? Is it so large the the client was 
> just busy writing it for a very long time? Are there any signs of 
> disconnections or ping tmeouts in the logs?
>
> Avati
>
> On Sat, Jun 16, 2012 at 10:48 AM, Sean Fulton <sean at gcnpublishing.com 
> <mailto:sean at gcnpublishing.com>> wrote:
>
>     I do not mean to be argumentative, but I have to admit a little
>     frustration with Gluster. I know an enormous emount of effort has
>     gone into this product, and I just can't believe that with all the
>     effort behind it and so many people using it, it could be so fragile.
>
>     So here goes. Perhaps someone here can point to the error of my
>     ways. I really want this to work because it would be ideal for our
>     environment, but ...
>
>     Please note that all of the nodes below are OpenVZ nodes with
>     nfs/nfsd/fuse modules loaded on the hosts.
>
>     After spending months trying to get 3.2.5 and 3.2.6 working in a
>     production environment, I gave up on Gluster and went with a
>     Linux-HA/NFS cluster which just works. The problems I had with
>     gluster were strange lock-ups, split brains, and too many
>     instances where the whole cluster was off-line until I reloaded
>     the data.
>
>     So wiith the release of 3.3, I decided to give it another try. I
>     created one relicated volume on my two NFS servers.
>
>     I then mounted the volume on a client as follows:
>     10.10.10.7:/pub2    /pub2     nfs
>     rw,noacl,noatime,nodiratime,soft,proto=tcp,vers=3,defaults 0 0
>
>     I threw some data at it (find / -mount -print | cpio -pvdum
>     /pub2/test)
>
>     Within 10 seconds it locked up solid. No error messages on any of
>     the servers, the client was unresponsive and load on the client
>     was 15+. I restarted glusterd on both of my NFS servers, and the
>     client remained locked. Finally I killed the cpio process on the
>     client. When I started another cpio, it runs further than before,
>     but now the logs on my NFS/Gluster server say:
>
>     [2012-06-16 13:37:35.242754] I
>     [afr-self-heal-common.c:1318:afr_sh_missing_entries_lookup_done]
>     0-pub2-replicate-0: No sources for dir of
>     <gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure, in missing
>     entry self-heal, continuing with the rest of the self-heals
>     [2012-06-16 13:37:35.243315] I
>     [afr-self-heal-common.c:994:afr_sh_missing_entries_done]
>     0-pub2-replicate-0: split brain found, aborting selfheal of
>     <gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure
>     [2012-06-16 13:37:35.243350] E
>     [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk]
>     0-pub2-replicate-0: background  data gfid self-heal failed on
>     <gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure
>
>
>     This still seems to be an INCREDIBLY fragile system. Why would it
>     lock solid while copying a large file? Why no errors in the logs?
>
>     I am the only one seeing this kind of behavior?
>
>     sean
>
>
>
>
>
>     -- 
>     Sean Fulton
>     GCN Publishing, Inc.
>     Internet Design, Development and Consulting For Today's Media
>     Companies
>     http://www.gcnpublishing.com
>     (203) 665-6211, x203 <tel:%28203%29%20665-6211%2C%20x203>
>
>     _______________________________________________
>     Gluster-users mailing list
>     Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>     http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>
>

-- 
Sean Fulton
GCN Publishing, Inc.
Internet Design, Development and Consulting For Today's Media Companies
http://www.gcnpublishing.com
(203) 665-6211, x203

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gluster.org/pipermail/gluster-users/attachments/20120616/e8279678/attachment-0001.htm>