Hi Anand, ulimit -l running as root is 64. This dmesg out is from the second system. I don't see any new on the first system other that what were there when system booted. Do you want to see the whole dmesg output? Where should I post it, there are 1600 lines. ... ling INFO: task glusterfs:8880 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. glusterfs D 0000000000000000 0 8880 1 0x00000080 ffff880614b75e48 0000000000000086 0000000000000000 ffff88010ed65d80 000000000000038b 000000000000038b ffff880614b75ee8 ffffffff814ef8f5 ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78 Call Trace: [<ffffffff814ef8f5>] ? page_fault+0x25/0x30 [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30 [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20 [<ffffffff814ee6c2>] ? down_write+0x32/0x40 [<ffffffff81141768>] sys_munmap+0x48/0x80 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b INFO: task glusterfs:8880 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. glusterfs D 0000000000000000 0 8880 1 0x00000080 ffff880614b75e48 0000000000000086 0000000000000000 ffff88010ed65d80 000000000000038b 000000000000038b ffff880614b75ee8 ffffffff814ef8f5 ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78 Call Trace: [<ffffffff814ef8f5>] ? page_fault+0x25/0x30 [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30 [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20 [<ffffffff814ee6c2>] ? down_write+0x32/0x40 [<ffffffff81141768>] sys_munmap+0x48/0x80 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b INFO: task glusterfs:8880 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. glusterfs D 0000000000000009 0 8880 1 0x00000080 ffff880614b75e08 0000000000000086 0000000000000000 ffff88062d638338 ffff880c30ef88c0 ffffffff8120d34f ffff880614b75d98 ffff88061406f740 ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78 Call Trace: [<ffffffff8120d34f>] ? security_inode_permission+0x1f/0x30 [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30 [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20 [<ffffffff814ee6c2>] ? down_write+0x32/0x40 [<ffffffff81131ddc>] sys_mmap_pgoff+0x5c/0x2d0 [<ffffffff81010469>] sys_mmap+0x29/0x30 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b INFO: task glusterfs:8880 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. glusterfs D 0000000000000009 0 8880 1 0x00000080 ffff880614b75e08 0000000000000086 0000000000000000 ffff88062d638338 ffff880c30ef88c0 ffffffff8120d34f ffff880614b75d98 ffff88061406f740 ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78 Call Trace: [<ffffffff8120d34f>] ? security_inode_permission+0x1f/0x30 [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30 [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20 [<ffffffff814ee6c2>] ? down_write+0x32/0x40 [<ffffffff81131ddc>] sys_mmap_pgoff+0x5c/0x2d0 [<ffffffff81010469>] sys_mmap+0x29/0x30 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b INFO: task glusterfs:8880 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. glusterfs D 0000000000000003 0 8880 1 0x00000080 ffff880614b75e08 0000000000000086 0000000000000000 ffff880630ab1ab8 ffff880c30ef88c0 ffffffff8120d34f ffff880614b75d98 ffff88062df10480 ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78 Call Trace: [<ffffffff8120d34f>] ? security_inode_permission+0x1f/0x30 [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30 [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20 [<ffffffff814ee6c2>] ? down_write+0x32/0x40 [<ffffffff81131ddc>] sys_mmap_pgoff+0x5c/0x2d0 [<ffffffff81010469>] sys_mmap+0x29/0x30 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b INFO: task glusterfsd:9471 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. glusterfsd D 0000000000000004 0 9471 1 0x00000080 ffff8801077c3740 0000000000000082 0000000000000000 ffff8801077c36b8 ffffffff8127f138 0000000000000000 0000000000000000 ffff8801077c36d8 ffff8806146f4638 ffff8801077c3fd8 000000000000f4e8 ffff8806146f4638 Call Trace: [<ffffffff8127f138>] ? swiotlb_dma_mapping_error+0x18/0x30 [<ffffffff8127f138>] ? swiotlb_dma_mapping_error+0x18/0x30 [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffffa019607a>] ? ixgbe_xmit_frame_ring+0x93a/0xfc0 [ixgbe] [<ffffffff814ef1f6>] rwsem_down_read_failed+0x26/0x30 [<ffffffff81276e84>] call_rwsem_down_read_failed+0x14/0x30 [<ffffffff814ee6f4>] ? down_read+0x24/0x30 [<ffffffff81042bc7>] __do_page_fault+0x187/0x480 [<ffffffff81430c38>] ? dev_queue_xmit+0x178/0x6b0 [<ffffffff8146809c>] ? ip_finish_output+0x13c/0x310 [<ffffffff814f253e>] do_page_fault+0x3e/0xa0 [<ffffffff814ef8f5>] page_fault+0x25/0x30 [<ffffffff81275a6d>] ? copy_user_generic_string+0x2d/0x40 [<ffffffff81425655>] ? memcpy_toiovec+0x55/0x80 [<ffffffff81426070>] skb_copy_datagram_iovec+0x60/0x2c0 [<ffffffff8141ceac>] ? lock_sock_nested+0xac/0xc0 [<ffffffff814ef5cb>] ? _spin_unlock_bh+0x1b/0x20 [<ffffffff814722d5>] tcp_recvmsg+0xca5/0xe90 [<ffffffff814925ea>] inet_recvmsg+0x5a/0x90 [<ffffffff8141bff1>] sock_aio_read+0x181/0x190 [<ffffffff810566a3>] ? perf_event_task_sched_out+0x33/0x80 [<ffffffff8100988e>] ? __switch_to+0x26e/0x320 [<ffffffff8141be70>] ? sock_aio_read+0x0/0x190 [<ffffffff8117614b>] do_sync_readv_writev+0xfb/0x140 [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8120c1e6>] ? security_file_permission+0x16/0x20 [<ffffffff811771df>] do_readv_writev+0xcf/0x1f0 [<ffffffff811b9b50>] ? sys_epoll_wait+0xa0/0x300 [<ffffffff814ecb0e>] ? thread_return+0x4e/0x760 [<ffffffff81177513>] vfs_readv+0x43/0x60 [<ffffffff81177641>] sys_readv+0x51/0xb0 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b INFO: task glusterfsd:9545 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. glusterfsd D 0000000000000006 0 9545 1 0x00000080 ffff880c24a7bcf8 0000000000000082 0000000000000000 ffffffff8107c0a0 ffff88066a0a7580 ffff880c30460000 0000000000000000 0000000000000000 ffff88066a0a7b38 ffff880c24a7bfd8 000000000000f4e8 ffff88066a0a7b38 Call Trace: [<ffffffff8107c0a0>] ? process_timeout+0x0/0x10 [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff8127f18c>] ? is_swiotlb_buffer+0x3c/0x50 [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30 [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20 [<ffffffff814ee6c2>] ? down_write+0x32/0x40 [<ffffffffa0211b96>] ib_umem_release+0x76/0x110 [ib_core] [<ffffffffa0230d52>] mlx4_ib_dereg_mr+0x32/0x50 [mlx4_ib] [<ffffffffa020cd85>] ib_dereg_mr+0x35/0x50 [ib_core] [<ffffffffa041bc5b>] ib_uverbs_dereg_mr+0x7b/0xf0 [ib_uverbs] [<ffffffffa04194ef>] ib_uverbs_write+0xbf/0xe0 [ib_uverbs] [<ffffffff8117646d>] ? rw_verify_area+0x5d/0xc0 [<ffffffff81176588>] vfs_write+0xb8/0x1a0 [<ffffffff810d4692>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff81176f91>] sys_write+0x51/0x90 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b INFO: task glusterfsd:9546 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. glusterfsd D 0000000000000004 0 9546 1 0x00000080 ffff880c0634bcf0 0000000000000082 ffff880c0634bcb8 ffff880c0634bcb4 0000000000015f80 ffff88063fc24b00 ffff880655495f80 0000000000000400 ffff880c2dccc5f8 ffff880c0634bfd8 000000000000f4e8 ffff880c2dccc5f8 Call Trace: [<ffffffff810566a3>] ? perf_event_task_sched_out+0x33/0x80 [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320 [<ffffffff814ef1f6>] rwsem_down_read_failed+0x26/0x30 [<ffffffff814ecb0e>] ? thread_return+0x4e/0x760 [<ffffffff81276e84>] call_rwsem_down_read_failed+0x14/0x30 [<ffffffff814ee6f4>] ? down_read+0x24/0x30 [<ffffffff81042bc7>] __do_page_fault+0x187/0x480 [<ffffffffa0419e16>] ? ib_uverbs_event_read+0x1d6/0x240 [ib_uverbs] [<ffffffff814f253e>] do_page_fault+0x3e/0xa0 [<ffffffff814ef8f5>] page_fault+0x25/0x30 INFO: task glusterfsd:9553 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. glusterfsd D 000000000000000e 0 9553 1 0x00000080 ffff8806e131dd98 0000000000000082 0000000000000000 ffff8806e131dd64 ffff8806e131dd48 ffffffffa026dfb6 ffff8806e131dd28 ffffffff00000000 ffff880c2f41c678 ffff8806e131dfd8 000000000000f4e8 ffff880c2f41c678 Call Trace: [<ffffffffa026dfb6>] ? xfs_attr_get+0xb6/0xc0 [xfs] [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30 [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20 [<ffffffff814ee6c2>] ? down_write+0x32/0x40 [<ffffffff81136009>] sys_madvise+0x329/0x760 [<ffffffff81195740>] ? mntput_no_expire+0x30/0x110 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b INFO: task glusterfs:8880 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. glusterfs D 0000000000000003 0 8880 1 0x00000080 ffff880614b75e08 0000000000000086 0000000000000000 ffff880630ab1ab8 ffff880c30ef88c0 ffffffff8120d34f ffff880614b75d98 ffff88062df10480 ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78 Call Trace: [<ffffffff8120d34f>] ? security_inode_permission+0x1f/0x30 [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30 [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20 [<ffffffff814ee6c2>] ? down_write+0x32/0x40 [<ffffffff81131ddc>] sys_mmap_pgoff+0x5c/0x2d0 [<ffffffff81010469>] sys_mmap+0x29/0x30 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b On 06/08/2012 05:18 PM, Anand Avati wrote: > Those are 4.x GB. Can you post dmesg output as well? Also, what's > 'ulimit -l' on your system? > > On Fri, Jun 8, 2012 at 4:41 PM, Ling Ho <ling at slac.stanford.edu > <mailto:ling at slac.stanford.edu>> wrote: > > > This is the core file from the crash just now > > [root at psanaoss213 /]# ls -al core* > -rw------- 1 root root 4073594880 <tel:4073594880> Jun 8 15:05 > core.22682 > > From yesterday: > [root at psanaoss214 /]# ls -al core* > -rw------- 1 root root 4362727424 Jun 8 00:58 core.13483 > -rw------- 1 root root 4624773120 Jun 8 03:21 core.8792 > > > > On 06/08/2012 04:34 PM, Anand Avati wrote: >> Is it possible the system was running low on memory? I see you >> have 48GB, but memory registration failure typically would be >> because the system limit on the number of pinnable pages in RAM >> was hit. Can you tell us the size of your core dump files after >> the crash? >> >> Avati >> >> On Fri, Jun 8, 2012 at 4:22 PM, Ling Ho <ling at slac.stanford.edu >> <mailto:ling at slac.stanford.edu>> wrote: >> >> Hello, >> >> I have a brick that crashed twice today, and another >> different brick that crashed just a while a go. >> >> This is what I see in one of the brick logs: >> >> patchset: git://git.gluster.com/glusterfs.git >> <http://git.gluster.com/glusterfs.git> >> patchset: git://git.gluster.com/glusterfs.git >> <http://git.gluster.com/glusterfs.git> >> signal received: 6 >> signal received: 6 >> time of crash: 2012-06-08 15:05:11 >> configuration details: >> argp 1 >> backtrace 1 >> dlfcn 1 >> fdatasync 1 >> libpthread 1 >> llistxattr 1 >> setfsid 1 >> spinlock 1 >> epoll.h 1 >> xattr.h 1 >> st_atim.tv_nsec 1 >> package-string: glusterfs 3.2.6 >> /lib64/libc.so.6[0x34bc032900] >> /lib64/libc.so.6(gsignal+0x35)[0x34bc032885] >> /lib64/libc.so.6(abort+0x175)[0x34bc034065] >> /lib64/libc.so.6[0x34bc06f977] >> /lib64/libc.so.6[0x34bc075296] >> /opt/glusterfs/3.2.6/lib64/libglusterfs.so.0(__gf_free+0x44)[0x7f1740ba25e4] >> /opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_transport_destroy+0x47)[0x7f1740956967] >> /opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_transport_unref+0x62)[0x7f1740956a32] >> /opt/glusterfs/3.2.6/lib64/glusterfs/3.2.6/rpc-transport/rdma.so(+0xc135)[0x7f173ca27135] >> /lib64/libpthread.so.0[0x34bc8077f1] >> /lib64/libc.so.6(clone+0x6d)[0x34bc0e5ccd] >> --------- >> >> And somewhere before these, there is also >> [2012-06-08 15:05:07.512604] E [rdma.c:198:rdma_new_post] >> 0-rpc-transport/rdma: memory registration failed >> >> I have 48GB of memory on the system: >> >> # free >> total used free shared buffers >> cached >> Mem: 49416716 34496648 14920068 0 >> 31692 28209612 >> -/+ buffers/cache: 6255344 43161372 >> Swap: 4194296 1740 4192556 <tel:1740%20%20%20%204192556> >> >> # uname -a >> Linux psanaoss213 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 >> 15:22:22 EST 2012 x86_64 x86_64 x86_64 GNU/Linux >> >> The server gluster versions is 3.2.6-1. I am using have both >> rdma clients and tcp clients over 10Gb/s network. >> >> Any suggestion what I should look for? >> >> Is there a way to just restart the brick, and not glusterd on >> the server? I have 8 bricks on the server. >> >> Thanks, >> ... >> ling >> >> >> Here's the volume info: >> >> # gluster volume info >> >> Volume Name: ana12 >> Type: Distribute >> Status: Started >> Number of Bricks: 40 >> Transport-type: tcp,rdma >> Bricks: >> Brick1: psanaoss214:/brick1 >> Brick2: psanaoss214:/brick2 >> Brick3: psanaoss214:/brick3 >> Brick4: psanaoss214:/brick4 >> Brick5: psanaoss214:/brick5 >> Brick6: psanaoss214:/brick6 >> Brick7: psanaoss214:/brick7 >> Brick8: psanaoss214:/brick8 >> Brick9: psanaoss211:/brick1 >> Brick10: psanaoss211:/brick2 >> Brick11: psanaoss211:/brick3 >> Brick12: psanaoss211:/brick4 >> Brick13: psanaoss211:/brick5 >> Brick14: psanaoss211:/brick6 >> Brick15: psanaoss211:/brick7 >> Brick16: psanaoss211:/brick8 >> Brick17: psanaoss212:/brick1 >> Brick18: psanaoss212:/brick2 >> Brick19: psanaoss212:/brick3 >> Brick20: psanaoss212:/brick4 >> Brick21: psanaoss212:/brick5 >> Brick22: psanaoss212:/brick6 >> Brick23: psanaoss212:/brick7 >> Brick24: psanaoss212:/brick8 >> Brick25: psanaoss213:/brick1 >> Brick26: psanaoss213:/brick2 >> Brick27: psanaoss213:/brick3 >> Brick28: psanaoss213:/brick4 >> Brick29: psanaoss213:/brick5 >> Brick30: psanaoss213:/brick6 >> Brick31: psanaoss213:/brick7 >> Brick32: psanaoss213:/brick8 >> Brick33: psanaoss215:/brick1 >> Brick34: psanaoss215:/brick2 >> Brick35: psanaoss215:/brick4 >> Brick36: psanaoss215:/brick5 >> Brick37: psanaoss215:/brick7 >> Brick38: psanaoss215:/brick8 >> Brick39: psanaoss215:/brick3 >> Brick40: psanaoss215:/brick6 >> Options Reconfigured: >> performance.io-thread-count: 16 >> performance.write-behind-window-size: 16MB >> performance.cache-size: 1GB >> nfs.disable: on >> performance.cache-refresh-timeout: 1 >> network.ping-timeout: 42 >> performance.cache-max-file-size: 1PB >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://gluster.org/pipermail/gluster-users/attachments/20120608/2b789c4a/attachment-0001.htm>