Preliminary tests are showing that selected backups are working with patch 257. I will test more extensively tonight. Harris ----- Original Message ----- From: "Harris Landgarten" <harrisl@xxxxxxxxxxxxx> To: "Harris Landgarten" <harrisl@xxxxxxxxxxxxx> Cc: "gluster-devel" <gluster-devel@xxxxxxxxxx>, "Amar S. Tumballi" <amar@xxxxxxxxxxxxx> Sent: Monday, July 2, 2007 8:00:33 AM (GMT-0500) America/New_York Subject: Re: difficult bug in 2.5 mainline While most backups are now failing with loss of connectivity, One test I ran on a mailbox that successful backuped up under patch 245 now fails late in the process after copying over 16000 files from gluster to backup/tmp when failing to create a link. I know Zimbra uses hard links to save space. Here is the error message from Zimbra: ERRORS harrisl@xxxxxxxxxxxxx: link(/opt/zimbra/backup/sessions/full-20070629.175030.298/accounts/69e/308/69e308b0-0923-4aa7-9036-3bbae4806b2c/blobs/3/4/bZyRbDLjWxuGkqeDEktny,oLmXY=20179-19047.msg1, /opt/zimbra/backup/tmp/full-20070702.111546.524/accounts/69e/308/69e308b0-0923-4aa7-903 java.io.IOException: link(/opt/zimbra/backup/sessions/full-20070629.175030.298/accounts/69e/308/69e308b0-0923-4aa7-9036-3bbae4806b2c/blobs/3/4/bZyRbDLjWxuGkqeDEktny,oLmXY=20179-19047.msg1, /opt/zimbra/backup/tmp/full-20070702.111546.524/accounts/69e/308/69e308b0-0923-4aa7-903 at com.zimbra.znative.IO.link0(Native Method) at com.zimbra.znative.IO.link(IO.java:48) at com.zimbra.common.io.AsyncFileCopier$WorkerThread.link(AsyncFileCopier.java:167) at com.zimbra.common.io.AsyncFileCopier$WorkerThread.run(AsyncFileCopier.java:130) Harris ----- Original Message ----- From: "Harris Landgarten" <harrisl@xxxxxxxxxxxxx> To: "Harris Landgarten" <harrisl@xxxxxxxxxxxxx> Cc: "gluster-devel" <gluster-devel@xxxxxxxxxx>, "Amar S. Tumballi" <amar@xxxxxxxxxxxxx> Sent: Monday, July 2, 2007 7:04:40 AM (GMT-0500) America/New_York Subject: Re: difficult bug in 2.5 mainline Re ran backup test with bricks running under gdb Brick1 crash bt: 0xb7e4731d in vasprintf () from /lib/libc.so.6 (gdb) bt #0 0xb7e4731d in vasprintf () from /lib/libc.so.6 #1 0xb7e2e6be in asprintf () from /lib/libc.so.6 #2 0xb75c11df in server_lookup_cbk () from /usr/lib/glusterfs/1.3.0-pre5/xlator/protocol/server.so #3 0xb7f5d5be in call_resume (stub=0xb5c03938) at call-stub.c:2697 #4 0xb75d1770 in iot_reply () from /usr/lib/glusterfs/1.3.0-pre5/xlator/performance/io-threads.so #5 0xb7f2d3db in start_thread () from /lib/libpthread.so.0 #6 0xb7eb726e in clone () from /lib/libc.so.6 brick2 seqfault Program received signal SIGSEGV, Segmentation fault. [Switching to Thread -1243444336 (LWP 14631)] dict_serialized_length (dict=0x80cc448) at dict.c:333 333 len += strlen (pair->key) + 1; (gdb) bt #0 dict_serialized_length (dict=0x80cc448) at dict.c:333 #1 0xb7fc748b in gf_block_to_iovec (blk=0x80c6a58, iov=0xb5e28290, cnt=3) at protocol.c:410 #2 0xb762e538 in generic_reply () from /usr/lib/glusterfs/1.3.0-pre5/xlator/protocol/server.so #3 0xb7631229 in server_lookup_cbk () from /usr/lib/glusterfs/1.3.0-pre5/xlator/protocol/server.so #4 0xb7fcd5be in call_resume (stub=0x8099b38) at call-stub.c:2697 #5 0xb7641770 in iot_reply () from /usr/lib/glusterfs/1.3.0-pre5/xlator/performance/io-threads.so #6 0xb7f9d3db in start_thread () from /lib/libpthread.so.0 #7 0xb7f2726e in clone () from /lib/libc.so.6 Harris ----- Original Message ----- From: "Harris Landgarten" <harrisl@xxxxxxxxxxxxx> To: "Harris Landgarten" <harrisl@xxxxxxxxxxxxx> Cc: "gluster-devel" <gluster-devel@xxxxxxxxxx>, "Amar S. Tumballi" <amar@xxxxxxxxxxxxx> Sent: Monday, July 2, 2007 6:54:12 AM (GMT-0500) America/New_York Subject: Re: difficult bug in 2.5 mainline Backup test run on patch 252: Zimbra client crashed with BT: #0 0xb7f81f4c in raise () from /lib/libpthread.so.0 (gdb) bt #0 0xb7f81f4c in raise () from /lib/libpthread.so.0 #1 0xb7fac628 in gf_print_trace (signum=6) at common-utils.c:211 #2 <signal handler called> #3 0xb7e7e986 in raise () from /lib/libc.so.6 #4 0xb7e80043 in abort () from /lib/libc.so.6 #5 0xb7e7812d in __assert_fail () from /lib/libc.so.6 #6 0xb7fafe90 in inode_unref (inode=0x80b5fc8) at inode.c:336 #7 0x0804b7a4 in fuse_loc_wipe (fuse_loc=0x847e678) at fuse-bridge.c:97 #8 0x0804b82d in free_state (state=0x847e670) at fuse-bridge.c:129 #9 0x0804efb4 in fuse_entry_cbk (frame=0x84cb380, cookie=0x84ce360, this=0x8058db0, op_ret=8, op_errno=107, inode=0x80b5fc8, buf=0x84b8d90) at fuse-bridge.c:368 #10 0xb7fa9cac in default_lookup_cbk (frame=0x84ce360, cookie=0x84bb7e8, this=0x80587f0, op_ret=8, op_errno=107, inode=0x80b5fc8, buf=0x84b8d90) at defaults.c:40 #11 0xb7fa9cac in default_lookup_cbk (frame=0x84bb7e8, cookie=0x847dc28, this=0x8058760, op_ret=8, op_errno=107, inode=0x80b5fc8, buf=0x84b8d90) at defaults.c:40 #12 0xb75edbbd in unify_sh_opendir_cbk (frame=0x847dc28, cookie=0x8052500, this=0x80579e8, op_ret=8, op_errno=17, fd=0x843ffa0) at unify-self-heal.c:380 #13 0xb75f5f62 in client_opendir_cbk (frame=0x84b9088, args=0x80929b8) at client-protocol.c:3213 #14 0xb75f9077 in notify (this=0x8052a68, event=2, data=0x80902b8) at client-protocol.c:4191 #15 0xb7fada27 in transport_notify (this=0x6e5d, event=6) at transport.c:152 #16 0xb7fae499 in sys_epoll_iteration (ctx=0xbffcfff8) at epoll.c:54 #17 0xb7fadafd in poll_iteration (ctx=0xbffcfff8) at transport.c:260 #18 0x0804a170 in main (argc=6, argv=0xbffd00d4) at glusterfs.c:341 brick2 with namespace crashed as well. brick1 stayed up client2 recovered when brick2 was restarted. No data was written from gluster to backup tmp. Harris ----- Original Message ----- From: "Harris Landgarten" <harrisl@xxxxxxxxxxxxx> To: "Harris Landgarten" <harrisl@xxxxxxxxxxxxx> Cc: "gluster-devel" <gluster-devel@xxxxxxxxxx>, "Amar S. Tumballi" <amar@xxxxxxxxxxxxx> Sent: Sunday, July 1, 2007 10:03:02 PM (GMT-0500) America/New_York Subject: Re: difficult bug in 2.5 mainline The backup hung as first described. No data was written from the secondary volume on gluster to the backup tmp dir. Harris ----- Original Message ----- From: "Harris Landgarten" <harrisl@xxxxxxxxxxxxx> To: "Amar S. Tumballi" <amar@xxxxxxxxxxxxx> Cc: "gluster-devel" <gluster-devel@xxxxxxxxxx> Sent: Sunday, July 1, 2007 9:46:18 PM (GMT-0500) America/New_York Subject: Re: difficult bug in 2.5 mainline Amar, The rm -rf bug is still there. See the last comment by Daniel to the ml in reply to the problem with rm -rf post to the ml. BTW files are being deleted but at the rate of about 1 every 3 sec with lots of lookups in the logs. I am going to check the other problem now. Harris ----- Original Message ----- From: "Amar S. Tumballi" <amar@xxxxxxxxxxxxx> To: "Harris Landgarten" <harrisl@xxxxxxxxxxxxx> Cc: "gluster-devel" <gluster-devel@xxxxxxxxxx> Sent: Sunday, July 1, 2007 7:55:09 PM (GMT-0500) America/New_York Subject: Re: difficult bug in 2.5 mainline Hi Harris, With the latest patch this bug is fixed. Also, i hope it should fix the problem of 'rm -rf' too.. please confirm. i am looking into other strange bug reported by you. -bulde On 7/2/07 , Harris Landgarten < harrisl@xxxxxxxxxxxxx > wrote: Disabling posix-locks changes the problem The client crashes along with the lock-server brick Here is the bt from the client: #0 unify_bg_cbk (frame=0xe080168, cookie=0xe1109c8, this=0x8057730, op_ret=0, op_errno=13) at unify.c:83 83 callcnt = --local->call_count; (gdb) bt #0 unify_bg_cbk (frame=0xe080168, cookie=0xe1109c8, this=0x8057730, op_ret=0, op_errno=13) at unify.c:83 #1 0xb75b96e5 in client_unlink_cbk (frame=0xe1109c8, args=0x8059248) at client-protocol.c:2969 #2 0xb75beff5 in notify (this=0x8057730, event=2, data=0x8095338) at client-protocol.c:4184 #3 0xb7f73827 in transport_notify (this=0x0, event= 235405672 ) at transport.c:152 #4 0xb7f74299 in sys_epoll_iteration (ctx=0xbfb96248) at epoll.c:54 #5 0xb7f738fd in poll_iteration (ctx=0xbfb96248) at transport.c:260 #6 0x0804a170 in main (argc=6, argv=0xbfb96324) at glusterfs.c:341 (gdb) print local $1 = (unify_local_t *) 0x0 Harris ----- Original Message ----- From: "Harris Landgarten" < harrisl@xxxxxxxxxxxxx > To: "gluster-devel" < gluster-devel@xxxxxxxxxx > Sent: Sunday, July 1, 2007 10:56:05 AM (GMT-0500) America/New_York Subject: difficult bug in 2.5 mainline I am trying to track down a bug that is causing hangs in 2.5-patch-249 and all previous. This happens during a full Zimbra backup of certain accounts to /mnt/glusterfs/backups The first stage of the backup copies indexes and primary storage to /mnt/glusterfs/backups/tmp All of this data resides in local storage and the writing to gluster is successful. The next stage copies secondary storage to /mnt/glusterfs/backups/tmp This fails in the following way: Brick1 hangs with no errors Brick2 hangs with no errors Zimbra client hangs with no errors second client loses connectivity The second client bails after 2 min but cannot connect The Zimbra client never bails I then restart the bricks After both bricks are restarted, the second client reconnects and a hung df -h completes Zimbra client stays in a hung unconnected start ls -l /mnt/glusterfs hangs Only way is reset is kill -9 pidof glusterfs umount /mnt/glusterfs glusterfs Post mortem examination of /mnt/glusterfs/backups/tmp shows that a few files have the written from the secondary storage volume. I this can over 15,000 files should have been written. Note: this only happen with large email boxed with some large >10M files. Note: with patch-247 the zimbra client would seqfault. With 249 it just hangs in unrecoverable state. Harris _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxx http://lists.nongnu.org/mailman/listinfo/gluster-devel _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxx http://lists.nongnu.org/mailman/listinfo/gluster-devel -- Amar Tumballi http://amar.80x25.org [bulde on #gluster/irc.gnu.org] _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxx http://lists.nongnu.org/mailman/listinfo/gluster-devel