Brick crashes

ling at slac.stanford.edu (Ling Ho) · Fri, 08 Jun 2012 16:22:59 -0700

Hello,

I have a brick that crashed twice today, and another different brick 
that crashed just a while a go.

This is what I see in one of the brick logs:

patchset: git://git.gluster.com/glusterfs.git
patchset: git://git.gluster.com/glusterfs.git
signal received: 6
signal received: 6
time of crash: 2012-06-08 15:05:11
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.2.6
/lib64/libc.so.6[0x34bc032900]
/lib64/libc.so.6(gsignal+0x35)[0x34bc032885]
/lib64/libc.so.6(abort+0x175)[0x34bc034065]
/lib64/libc.so.6[0x34bc06f977]
/lib64/libc.so.6[0x34bc075296]
/opt/glusterfs/3.2.6/lib64/libglusterfs.so.0(__gf_free+0x44)[0x7f1740ba25e4]
/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_transport_destroy+0x47)[0x7f1740956967]
/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_transport_unref+0x62)[0x7f1740956a32]
/opt/glusterfs/3.2.6/lib64/glusterfs/3.2.6/rpc-transport/rdma.so(+0xc135)[0x7f173ca27135]
/lib64/libpthread.so.0[0x34bc8077f1]
/lib64/libc.so.6(clone+0x6d)[0x34bc0e5ccd]
---------

And somewhere before these, there is also
[2012-06-08 15:05:07.512604] E [rdma.c:198:rdma_new_post] 
0-rpc-transport/rdma: memory registration failed

I have 48GB of memory on the system:

# free
              total       used       free     shared    buffers     cached
Mem:      49416716   34496648   14920068          0      31692   28209612
-/+ buffers/cache:    6255344   43161372
Swap:      4194296       1740    4192556

# uname -a
Linux psanaoss213 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 
EST 2012 x86_64 x86_64 x86_64 GNU/Linux

The server gluster versions is 3.2.6-1. I am using have both rdma 
clients and tcp clients over 10Gb/s network.

Any suggestion what I should look for?

Is there a way to just restart the brick, and not glusterd on the 
server? I have 8 bricks on the server.

Thanks,
...
ling

Here's the volume info:

# gluster volume info

Volume Name: ana12
Type: Distribute
Status: Started
Number of Bricks: 40
Transport-type: tcp,rdma
Bricks:
Brick1: psanaoss214:/brick1
Brick2: psanaoss214:/brick2
Brick3: psanaoss214:/brick3
Brick4: psanaoss214:/brick4
Brick5: psanaoss214:/brick5
Brick6: psanaoss214:/brick6
Brick7: psanaoss214:/brick7
Brick8: psanaoss214:/brick8
Brick9: psanaoss211:/brick1
Brick10: psanaoss211:/brick2
Brick11: psanaoss211:/brick3
Brick12: psanaoss211:/brick4
Brick13: psanaoss211:/brick5
Brick14: psanaoss211:/brick6
Brick15: psanaoss211:/brick7
Brick16: psanaoss211:/brick8
Brick17: psanaoss212:/brick1
Brick18: psanaoss212:/brick2
Brick19: psanaoss212:/brick3
Brick20: psanaoss212:/brick4
Brick21: psanaoss212:/brick5
Brick22: psanaoss212:/brick6
Brick23: psanaoss212:/brick7
Brick24: psanaoss212:/brick8
Brick25: psanaoss213:/brick1
Brick26: psanaoss213:/brick2
Brick27: psanaoss213:/brick3
Brick28: psanaoss213:/brick4
Brick29: psanaoss213:/brick5
Brick30: psanaoss213:/brick6
Brick31: psanaoss213:/brick7
Brick32: psanaoss213:/brick8
Brick33: psanaoss215:/brick1
Brick34: psanaoss215:/brick2
Brick35: psanaoss215:/brick4
Brick36: psanaoss215:/brick5
Brick37: psanaoss215:/brick7
Brick38: psanaoss215:/brick8
Brick39: psanaoss215:/brick3
Brick40: psanaoss215:/brick6
Options Reconfigured:
performance.io-thread-count: 16
performance.write-behind-window-size: 16MB
performance.cache-size: 1GB
nfs.disable: on
performance.cache-refresh-timeout: 1
network.ping-timeout: 42
performance.cache-max-file-size: 1PB