Brick crashes

ling at slac.stanford.edu (Ling Ho) · Fri, 08 Jun 2012 17:26:32 -0700

Hi Anand,

ulimit -l running as root is 64.

This dmesg out is from the second system.

I don't see any new on the first system other that what were there when 
system booted.
Do you want to see the whole dmesg output? Where should I post it, there 
are 1600 lines.

...
ling

INFO: task glusterfs:8880 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
glusterfs     D 0000000000000000     0  8880      1 0x00000080
  ffff880614b75e48 0000000000000086 0000000000000000 ffff88010ed65d80
  000000000000038b 000000000000038b ffff880614b75ee8 ffffffff814ef8f5
  ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78
Call Trace:
  [<ffffffff814ef8f5>] ? page_fault+0x25/0x30
  [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0
  [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30
  [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20
  [<ffffffff814ee6c2>] ? down_write+0x32/0x40
  [<ffffffff81141768>] sys_munmap+0x48/0x80
  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task glusterfs:8880 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
glusterfs     D 0000000000000000     0  8880      1 0x00000080
  ffff880614b75e48 0000000000000086 0000000000000000 ffff88010ed65d80
  000000000000038b 000000000000038b ffff880614b75ee8 ffffffff814ef8f5
  ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78
Call Trace:
  [<ffffffff814ef8f5>] ? page_fault+0x25/0x30
  [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0
  [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30
  [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20
  [<ffffffff814ee6c2>] ? down_write+0x32/0x40
  [<ffffffff81141768>] sys_munmap+0x48/0x80
  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task glusterfs:8880 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
glusterfs     D 0000000000000009     0  8880      1 0x00000080
  ffff880614b75e08 0000000000000086 0000000000000000 ffff88062d638338
  ffff880c30ef88c0 ffffffff8120d34f ffff880614b75d98 ffff88061406f740
  ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78
Call Trace:
  [<ffffffff8120d34f>] ? security_inode_permission+0x1f/0x30
  [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0
  [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30
  [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20
  [<ffffffff814ee6c2>] ? down_write+0x32/0x40
  [<ffffffff81131ddc>] sys_mmap_pgoff+0x5c/0x2d0
  [<ffffffff81010469>] sys_mmap+0x29/0x30
  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task glusterfs:8880 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
glusterfs     D 0000000000000009     0  8880      1 0x00000080
  ffff880614b75e08 0000000000000086 0000000000000000 ffff88062d638338
  ffff880c30ef88c0 ffffffff8120d34f ffff880614b75d98 ffff88061406f740
  ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78
Call Trace:
  [<ffffffff8120d34f>] ? security_inode_permission+0x1f/0x30
  [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0
  [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30
  [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20
  [<ffffffff814ee6c2>] ? down_write+0x32/0x40
  [<ffffffff81131ddc>] sys_mmap_pgoff+0x5c/0x2d0
  [<ffffffff81010469>] sys_mmap+0x29/0x30
  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task glusterfs:8880 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
glusterfs     D 0000000000000003     0  8880      1 0x00000080
  ffff880614b75e08 0000000000000086 0000000000000000 ffff880630ab1ab8
  ffff880c30ef88c0 ffffffff8120d34f ffff880614b75d98 ffff88062df10480
  ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78
Call Trace:
  [<ffffffff8120d34f>] ? security_inode_permission+0x1f/0x30
  [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0
  [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30
  [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20
  [<ffffffff814ee6c2>] ? down_write+0x32/0x40
  [<ffffffff81131ddc>] sys_mmap_pgoff+0x5c/0x2d0
  [<ffffffff81010469>] sys_mmap+0x29/0x30
  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task glusterfsd:9471 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
glusterfsd    D 0000000000000004     0  9471      1 0x00000080
  ffff8801077c3740 0000000000000082 0000000000000000 ffff8801077c36b8
  ffffffff8127f138 0000000000000000 0000000000000000 ffff8801077c36d8
  ffff8806146f4638 ffff8801077c3fd8 000000000000f4e8 ffff8806146f4638
Call Trace:
  [<ffffffff8127f138>] ? swiotlb_dma_mapping_error+0x18/0x30
  [<ffffffff8127f138>] ? swiotlb_dma_mapping_error+0x18/0x30
  [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0
  [<ffffffffa019607a>] ? ixgbe_xmit_frame_ring+0x93a/0xfc0 [ixgbe]
  [<ffffffff814ef1f6>] rwsem_down_read_failed+0x26/0x30
  [<ffffffff81276e84>] call_rwsem_down_read_failed+0x14/0x30
  [<ffffffff814ee6f4>] ? down_read+0x24/0x30
  [<ffffffff81042bc7>] __do_page_fault+0x187/0x480
  [<ffffffff81430c38>] ? dev_queue_xmit+0x178/0x6b0
  [<ffffffff8146809c>] ? ip_finish_output+0x13c/0x310
  [<ffffffff814f253e>] do_page_fault+0x3e/0xa0
  [<ffffffff814ef8f5>] page_fault+0x25/0x30
  [<ffffffff81275a6d>] ? copy_user_generic_string+0x2d/0x40
  [<ffffffff81425655>] ? memcpy_toiovec+0x55/0x80
  [<ffffffff81426070>] skb_copy_datagram_iovec+0x60/0x2c0
  [<ffffffff8141ceac>] ? lock_sock_nested+0xac/0xc0
  [<ffffffff814ef5cb>] ? _spin_unlock_bh+0x1b/0x20
  [<ffffffff814722d5>] tcp_recvmsg+0xca5/0xe90
  [<ffffffff814925ea>] inet_recvmsg+0x5a/0x90
  [<ffffffff8141bff1>] sock_aio_read+0x181/0x190
  [<ffffffff810566a3>] ? perf_event_task_sched_out+0x33/0x80
  [<ffffffff8100988e>] ? __switch_to+0x26e/0x320
  [<ffffffff8141be70>] ? sock_aio_read+0x0/0x190
  [<ffffffff8117614b>] do_sync_readv_writev+0xfb/0x140
  [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40
  [<ffffffff8120c1e6>] ? security_file_permission+0x16/0x20
  [<ffffffff811771df>] do_readv_writev+0xcf/0x1f0
  [<ffffffff811b9b50>] ? sys_epoll_wait+0xa0/0x300
  [<ffffffff814ecb0e>] ? thread_return+0x4e/0x760
  [<ffffffff81177513>] vfs_readv+0x43/0x60
  [<ffffffff81177641>] sys_readv+0x51/0xb0
  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task glusterfsd:9545 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
glusterfsd    D 0000000000000006     0  9545      1 0x00000080
  ffff880c24a7bcf8 0000000000000082 0000000000000000 ffffffff8107c0a0
  ffff88066a0a7580 ffff880c30460000 0000000000000000 0000000000000000
  ffff88066a0a7b38 ffff880c24a7bfd8 000000000000f4e8 ffff88066a0a7b38
Call Trace:
  [<ffffffff8107c0a0>] ? process_timeout+0x0/0x10
  [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0
  [<ffffffff8127f18c>] ? is_swiotlb_buffer+0x3c/0x50
  [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30
  [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20
  [<ffffffff814ee6c2>] ? down_write+0x32/0x40
  [<ffffffffa0211b96>] ib_umem_release+0x76/0x110 [ib_core]
  [<ffffffffa0230d52>] mlx4_ib_dereg_mr+0x32/0x50 [mlx4_ib]
  [<ffffffffa020cd85>] ib_dereg_mr+0x35/0x50 [ib_core]
  [<ffffffffa041bc5b>] ib_uverbs_dereg_mr+0x7b/0xf0 [ib_uverbs]
  [<ffffffffa04194ef>] ib_uverbs_write+0xbf/0xe0 [ib_uverbs]
  [<ffffffff8117646d>] ? rw_verify_area+0x5d/0xc0
  [<ffffffff81176588>] vfs_write+0xb8/0x1a0
  [<ffffffff810d4692>] ? audit_syscall_entry+0x272/0x2a0
  [<ffffffff81176f91>] sys_write+0x51/0x90
  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task glusterfsd:9546 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
glusterfsd    D 0000000000000004     0  9546      1 0x00000080
  ffff880c0634bcf0 0000000000000082 ffff880c0634bcb8 ffff880c0634bcb4
  0000000000015f80 ffff88063fc24b00 ffff880655495f80 0000000000000400
  ffff880c2dccc5f8 ffff880c0634bfd8 000000000000f4e8 ffff880c2dccc5f8
Call Trace:
  [<ffffffff810566a3>] ? perf_event_task_sched_out+0x33/0x80
  [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0
  [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
  [<ffffffff814ef1f6>] rwsem_down_read_failed+0x26/0x30
  [<ffffffff814ecb0e>] ? thread_return+0x4e/0x760
  [<ffffffff81276e84>] call_rwsem_down_read_failed+0x14/0x30
  [<ffffffff814ee6f4>] ? down_read+0x24/0x30
  [<ffffffff81042bc7>] __do_page_fault+0x187/0x480
  [<ffffffffa0419e16>] ? ib_uverbs_event_read+0x1d6/0x240 [ib_uverbs]
  [<ffffffff814f253e>] do_page_fault+0x3e/0xa0
  [<ffffffff814ef8f5>] page_fault+0x25/0x30
INFO: task glusterfsd:9553 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
glusterfsd    D 000000000000000e     0  9553      1 0x00000080
  ffff8806e131dd98 0000000000000082 0000000000000000 ffff8806e131dd64
  ffff8806e131dd48 ffffffffa026dfb6 ffff8806e131dd28 ffffffff00000000
  ffff880c2f41c678 ffff8806e131dfd8 000000000000f4e8 ffff880c2f41c678
Call Trace:
  [<ffffffffa026dfb6>] ? xfs_attr_get+0xb6/0xc0 [xfs]
  [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0
  [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30
  [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20
  [<ffffffff814ee6c2>] ? down_write+0x32/0x40
  [<ffffffff81136009>] sys_madvise+0x329/0x760
  [<ffffffff81195740>] ? mntput_no_expire+0x30/0x110
  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
INFO: task glusterfs:8880 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
glusterfs     D 0000000000000003     0  8880      1 0x00000080
  ffff880614b75e08 0000000000000086 0000000000000000 ffff880630ab1ab8
  ffff880c30ef88c0 ffffffff8120d34f ffff880614b75d98 ffff88062df10480
  ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78
Call Trace:
  [<ffffffff8120d34f>] ? security_inode_permission+0x1f/0x30
  [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0
  [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30
  [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20
  [<ffffffff814ee6c2>] ? down_write+0x32/0x40
  [<ffffffff81131ddc>] sys_mmap_pgoff+0x5c/0x2d0
  [<ffffffff81010469>] sys_mmap+0x29/0x30
  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

On 06/08/2012 05:18 PM, Anand Avati wrote:
> Those are 4.x GB. Can you post dmesg output as well? Also, what's 
> 'ulimit -l' on your system?
>
> On Fri, Jun 8, 2012 at 4:41 PM, Ling Ho <ling at slac.stanford.edu 
> <mailto:ling at slac.stanford.edu>> wrote:
>
>
>     This is the core file from the crash just now
>
>     [root at psanaoss213 /]# ls -al core*
>     -rw------- 1 root root 4073594880 <tel:4073594880> Jun  8 15:05
>     core.22682
>
>     From yesterday:
>     [root at psanaoss214 /]# ls -al core*
>     -rw------- 1 root root 4362727424 Jun  8 00:58 core.13483
>     -rw------- 1 root root 4624773120 Jun  8 03:21 core.8792
>
>
>
>     On 06/08/2012 04:34 PM, Anand Avati wrote:
>>     Is it possible the system was running low on memory? I see you
>>     have 48GB, but memory registration failure typically would be
>>     because the system limit on the number of pinnable pages in RAM
>>     was hit. Can you tell us the size of your core dump files after
>>     the crash?
>>
>>     Avati
>>
>>     On Fri, Jun 8, 2012 at 4:22 PM, Ling Ho <ling at slac.stanford.edu
>>     <mailto:ling at slac.stanford.edu>> wrote:
>>
>>         Hello,
>>
>>         I have a brick that crashed twice today, and another
>>         different brick that crashed just a while a go.
>>
>>         This is what I see in one of the brick logs:
>>
>>         patchset: git://git.gluster.com/glusterfs.git
>>         <http://git.gluster.com/glusterfs.git>
>>         patchset: git://git.gluster.com/glusterfs.git
>>         <http://git.gluster.com/glusterfs.git>
>>         signal received: 6
>>         signal received: 6
>>         time of crash: 2012-06-08 15:05:11
>>         configuration details:
>>         argp 1
>>         backtrace 1
>>         dlfcn 1
>>         fdatasync 1
>>         libpthread 1
>>         llistxattr 1
>>         setfsid 1
>>         spinlock 1
>>         epoll.h 1
>>         xattr.h 1
>>         st_atim.tv_nsec 1
>>         package-string: glusterfs 3.2.6
>>         /lib64/libc.so.6[0x34bc032900]
>>         /lib64/libc.so.6(gsignal+0x35)[0x34bc032885]
>>         /lib64/libc.so.6(abort+0x175)[0x34bc034065]
>>         /lib64/libc.so.6[0x34bc06f977]
>>         /lib64/libc.so.6[0x34bc075296]
>>         /opt/glusterfs/3.2.6/lib64/libglusterfs.so.0(__gf_free+0x44)[0x7f1740ba25e4]
>>         /opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_transport_destroy+0x47)[0x7f1740956967]
>>         /opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_transport_unref+0x62)[0x7f1740956a32]
>>         /opt/glusterfs/3.2.6/lib64/glusterfs/3.2.6/rpc-transport/rdma.so(+0xc135)[0x7f173ca27135]
>>         /lib64/libpthread.so.0[0x34bc8077f1]
>>         /lib64/libc.so.6(clone+0x6d)[0x34bc0e5ccd]
>>         ---------
>>
>>         And somewhere before these, there is also
>>         [2012-06-08 15:05:07.512604] E [rdma.c:198:rdma_new_post]
>>         0-rpc-transport/rdma: memory registration failed
>>
>>         I have 48GB of memory on the system:
>>
>>         # free
>>                     total       used       free     shared    buffers
>>             cached
>>         Mem:      49416716   34496648   14920068          0    
>>          31692   28209612
>>         -/+ buffers/cache:    6255344   43161372
>>         Swap:      4194296 1740 4192556 <tel:1740%20%20%20%204192556>
>>
>>         # uname -a
>>         Linux psanaoss213 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10
>>         15:22:22 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
>>
>>         The server gluster versions is 3.2.6-1. I am using have both
>>         rdma clients and tcp clients over 10Gb/s network.
>>
>>         Any suggestion what I should look for?
>>
>>         Is there a way to just restart the brick, and not glusterd on
>>         the server? I have 8 bricks on the server.
>>
>>         Thanks,
>>         ...
>>         ling
>>
>>
>>         Here's the volume info:
>>
>>         # gluster volume info
>>
>>         Volume Name: ana12
>>         Type: Distribute
>>         Status: Started
>>         Number of Bricks: 40
>>         Transport-type: tcp,rdma
>>         Bricks:
>>         Brick1: psanaoss214:/brick1
>>         Brick2: psanaoss214:/brick2
>>         Brick3: psanaoss214:/brick3
>>         Brick4: psanaoss214:/brick4
>>         Brick5: psanaoss214:/brick5
>>         Brick6: psanaoss214:/brick6
>>         Brick7: psanaoss214:/brick7
>>         Brick8: psanaoss214:/brick8
>>         Brick9: psanaoss211:/brick1
>>         Brick10: psanaoss211:/brick2
>>         Brick11: psanaoss211:/brick3
>>         Brick12: psanaoss211:/brick4
>>         Brick13: psanaoss211:/brick5
>>         Brick14: psanaoss211:/brick6
>>         Brick15: psanaoss211:/brick7
>>         Brick16: psanaoss211:/brick8
>>         Brick17: psanaoss212:/brick1
>>         Brick18: psanaoss212:/brick2
>>         Brick19: psanaoss212:/brick3
>>         Brick20: psanaoss212:/brick4
>>         Brick21: psanaoss212:/brick5
>>         Brick22: psanaoss212:/brick6
>>         Brick23: psanaoss212:/brick7
>>         Brick24: psanaoss212:/brick8
>>         Brick25: psanaoss213:/brick1
>>         Brick26: psanaoss213:/brick2
>>         Brick27: psanaoss213:/brick3
>>         Brick28: psanaoss213:/brick4
>>         Brick29: psanaoss213:/brick5
>>         Brick30: psanaoss213:/brick6
>>         Brick31: psanaoss213:/brick7
>>         Brick32: psanaoss213:/brick8
>>         Brick33: psanaoss215:/brick1
>>         Brick34: psanaoss215:/brick2
>>         Brick35: psanaoss215:/brick4
>>         Brick36: psanaoss215:/brick5
>>         Brick37: psanaoss215:/brick7
>>         Brick38: psanaoss215:/brick8
>>         Brick39: psanaoss215:/brick3
>>         Brick40: psanaoss215:/brick6
>>         Options Reconfigured:
>>         performance.io-thread-count: 16
>>         performance.write-behind-window-size: 16MB
>>         performance.cache-size: 1GB
>>         nfs.disable: on
>>         performance.cache-refresh-timeout: 1
>>         network.ping-timeout: 42
>>         performance.cache-max-file-size: 1PB
>>
>>         _______________________________________________
>>         Gluster-users mailing list
>>         Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
>>         http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>>
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gluster.org/pipermail/gluster-users/attachments/20120608/2b789c4a/attachment-0001.htm>