Let me re-iterate, I really, really want to see Gluster work for our environment. I am hopeful this is something I did or something that can be easily fixed. Yes, there was an error on the client server: [586898.273283] INFO: task flush-0:45:633954 blocked for more than 120 seconds. [586898.273290] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [586898.273295] flush-0:45 D ffff8806037592d0 0 633954 2 0 0x00000000 [586898.273304] ffff88000d1ebbe0 0000000000000046 ffff88000d1ebd6c 0000000000000000 [586898.273312] ffff88000d1ebce0 ffffffff81054444 ffff88000d1ebc80 ffff88000d1ebbf0 [586898.273319] ffff8806050ac5f8 ffff880603759888 ffff88000d1ebfd8 ffff88000d1ebfd8 [586898.273326] Call Trace: [586898.273335] [<ffffffff81054444>] ? find_busiest_group+0x244/0xb20 [586898.273343] [<ffffffff811ab050>] ? inode_wait+0x0/0x20 [586898.273349] [<ffffffff811ab05e>] inode_wait+0xe/0x20 [586898.273357] [<ffffffff814e752f>] __wait_on_bit+0x5f/0x90 [586898.273365] [<ffffffff811bbd6c>] ? writeback_sb_inodes+0x13c/0x210 [586898.273370] [<ffffffff811bab28>] inode_wait_for_writeback+0x98/0xc0 [586898.273377] [<ffffffff81095550>] ? wake_bit_function+0x0/0x50 [586898.273382] [<ffffffff811bc1f8>] wb_writeback+0x218/0x420 [586898.273389] [<ffffffff814e637e>] ? thread_return+0x4e/0x7d0 [586898.273394] [<ffffffff811bc5a9>] wb_do_writeback+0x1a9/0x250 [586898.273402] [<ffffffff8107e2e0>] ? process_timeout+0x0/0x10 [586898.273407] [<ffffffff811bc6b3>] bdi_writeback_task+0x63/0x1b0 [586898.273412] [<ffffffff810953e7>] ? bit_waitqueue+0x17/0xc0 [586898.273419] [<ffffffff8114ce80>] ? bdi_start_fn+0x0/0x100 [586898.273424] [<ffffffff8114cf06>] bdi_start_fn+0x86/0x100 [586898.273429] [<ffffffff8114ce80>] ? bdi_start_fn+0x0/0x100 [586898.273434] [<ffffffff81094f36>] kthread+0x96/0xa0 [586898.273440] [<ffffffff8100c20a>] child_rip+0xa/0x20 [586898.273445] [<ffffffff81094ea0>] ? kthread+0x0/0xa0 [586898.273450] [<ffffffff8100c200>] ? child_rip+0x0/0x20 [root at server-10 ~]# Here are the file sizes. Secure was big, but was hung for quite a long time: -rw------- 1 root root 0 Dec 20 10:17 boot.log -rw------- 1 root utmp 281079168 Jun 15 21:53 btmp -rw------- 1 root root 337661 Jun 16 16:36 cron -rw-r--r-- 1 root root 0 Jun 9 18:33 dmesg -rw-r--r-- 1 root root 0 Jun 9 16:19 dmesg.old -rw-r--r-- 1 root root 98585 Dec 21 14:32 dracut.log drwxr-xr-x 5 root root 4096 Dec 21 16:53 glusterfs drwx------ 2 root root 4096 Mar 1 16:11 httpd -rw-r--r-- 1 root root 146000 Jun 16 13:36 lastlog drwxr-xr-x 2 root root 4096 Dec 20 10:35 mail -rw------- 1 root root 1072902 Jun 9 18:33 maillog -rw------- 1 root root 50638 Jun 16 12:13 messages drwxr-xr-x 2 root root 4096 Dec 30 16:14 nginx drwx------ 3 root root 4096 Dec 20 10:35 samba -rw------- 1 root root 222214339 Jun 16 13:37 secure -rw------- 1 root root 0 Sep 13 2011 spooler -rw------- 1 root root 0 Sep 13 2011 tallylog -rw-rw-r-- 1 root utmp 114432 Jun 16 13:37 wtmp -rw------- 1 root root 7015 Jun 16 12:13 yum.log On 06/16/2012 05:04 PM, Anand Avati wrote: > Was there anything in dmesg on the servers? If you are able to > reproduce the hang, can you get the output of 'gluster volume status > <name> callpool' and 'gluster volume status <name> nfs callpool' ? > > How big is the 'log/secure' file? Is it so large the the client was > just busy writing it for a very long time? Are there any signs of > disconnections or ping tmeouts in the logs? > > Avati > > On Sat, Jun 16, 2012 at 10:48 AM, Sean Fulton <sean at gcnpublishing.com > <mailto:sean at gcnpublishing.com>> wrote: > > I do not mean to be argumentative, but I have to admit a little > frustration with Gluster. I know an enormous emount of effort has > gone into this product, and I just can't believe that with all the > effort behind it and so many people using it, it could be so fragile. > > So here goes. Perhaps someone here can point to the error of my > ways. I really want this to work because it would be ideal for our > environment, but ... > > Please note that all of the nodes below are OpenVZ nodes with > nfs/nfsd/fuse modules loaded on the hosts. > > After spending months trying to get 3.2.5 and 3.2.6 working in a > production environment, I gave up on Gluster and went with a > Linux-HA/NFS cluster which just works. The problems I had with > gluster were strange lock-ups, split brains, and too many > instances where the whole cluster was off-line until I reloaded > the data. > > So wiith the release of 3.3, I decided to give it another try. I > created one relicated volume on my two NFS servers. > > I then mounted the volume on a client as follows: > 10.10.10.7:/pub2 /pub2 nfs > rw,noacl,noatime,nodiratime,soft,proto=tcp,vers=3,defaults 0 0 > > I threw some data at it (find / -mount -print | cpio -pvdum > /pub2/test) > > Within 10 seconds it locked up solid. No error messages on any of > the servers, the client was unresponsive and load on the client > was 15+. I restarted glusterd on both of my NFS servers, and the > client remained locked. Finally I killed the cpio process on the > client. When I started another cpio, it runs further than before, > but now the logs on my NFS/Gluster server say: > > [2012-06-16 13:37:35.242754] I > [afr-self-heal-common.c:1318:afr_sh_missing_entries_lookup_done] > 0-pub2-replicate-0: No sources for dir of > <gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure, in missing > entry self-heal, continuing with the rest of the self-heals > [2012-06-16 13:37:35.243315] I > [afr-self-heal-common.c:994:afr_sh_missing_entries_done] > 0-pub2-replicate-0: split brain found, aborting selfheal of > <gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure > [2012-06-16 13:37:35.243350] E > [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] > 0-pub2-replicate-0: background data gfid self-heal failed on > <gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure > > > This still seems to be an INCREDIBLY fragile system. Why would it > lock solid while copying a large file? Why no errors in the logs? > > I am the only one seeing this kind of behavior? > > sean > > > > > > -- > Sean Fulton > GCN Publishing, Inc. > Internet Design, Development and Consulting For Today's Media > Companies > http://www.gcnpublishing.com > (203) 665-6211, x203 <tel:%28203%29%20665-6211%2C%20x203> > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > > -- Sean Fulton GCN Publishing, Inc. Internet Design, Development and Consulting For Today's Media Companies http://www.gcnpublishing.com (203) 665-6211, x203 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://gluster.org/pipermail/gluster-users/attachments/20120616/e8279678/attachment-0001.htm>