Re: Сrash - 2.0.git-2009.06.16

Shehjar Tikoo <shehjart@xxxxxxxxxxx> · Sun, 05 Jul 2009 13:23:18 +0530

Hi

Firstly, a fix for this crash is under-review.
See http://patches.gluster.com/patch/672/

Secondly, I saw in the logs provided by you that the number of
outstanding/pending requests on a single thread were more than 64.
This could be because of a large number of concurrent meta-data
operations or a large number of files being open at the same
time or both.

I suggest that you try increasing the number of io-threads
at the client and server to 8 in order to balance the large
number of pending requests over more threads. It might result
in better performance.

-Shehjar

NovA wrote:
Hi everybody!

Recently I've migrated our small 24-node HPC-cluster from glusterFS 
1.3.8 unify to 2.0 distribute. It seems that performance really 
increased a lot. Thanks for your work!

I use the following translators. On servers: 
posix->locks->iothreads->protocol/server; on clients: 
protocol/client->distribute->iothreads->write-behind. IO-threads 
translator uses 4 threads, NO autoscaling.

Unfortunately, after upgrade I've got new issues. First, I've noticed
 a very high memory usage. Now GlusterFS on the head node eats 737Mb 
of RES memory and doesnt return it back. The memory usage have been 
increased in the migration process by the command "cd 
${namespace_export} && find . | (cd ${distribute_mount} && xargs -d 
'\n' stat -c '%n')". Note that provided 
migrate-unify-to-distribute.sh script (with "execute_on" function) 
doesn't work...

Second problem is more important. A client on one of the nodes has 
crashed today with the following backtrace:

------

Core was generated by `glusterfs -f /etc/glusterfs/client.vol -l 
/var/log/glusterfs/client.log /home'.

Program terminated with signal 11, Segmentation fault.

#0  0x00002b8039bec860 in ?? () from /lib64/libc.so.6

(gdb) bt

#0  0x00002b8039bec860 in ?? () from /lib64/libc.so.6

#1  0x00002b8039bedc0c in malloc () from /lib64/libc.so.6

#2  0x00002b8039548732 in fop_writev_stub (frame=<value optimized 
out>,

fn=0x2b803ab6c160 <iot_writev_wrapper>, fd=0x2aaab001e8a0, 
vector=0x2aaab0071d50,

count=<value optimized out>, off=105432, iobref=0x2aaab0082d60) at 
common-utils.h:166

#3  0x00002b803ab6ec00 in iot_writev (frame=0x4, this=0x6150c0, 
fd=0x2aaab0082711,

vector=0x2aaab0083060, count=3, offset=105432, iobref=0x2aaab0082d60)

at io-threads.c:1212

#4  0x00002b803ad7a3de in wb_sync (frame=0x2aaab0034c40, 
file=0x2aaaac007280,

winds=0x7fff717a5450) at write-behind.c:445

#5  0x00002b803ad7a4ff in wb_do_ops (frame=0x2aaab0034c40, 
file=0x2aaaac007280,

winds=0x7fff717a5450, unwinds=<value optimized out>, 
other_requests=0x7fff717a5430)

at write-behind.c:1579

#6  0x00002b803ad7a617 in wb_process_queue (frame=0x2aaab0034c40, 
file=0x2aaaac007280,

flush_all=0 '\0') at write-behind.c:1624

#7  0x00002b803ad7dd81 in wb_sync_cbk (frame=0x2aaab0034c40,

cookie=<value optimized out>, this=<value optimized out>, op_ret=19, 
op_errno=0,

stbuf=<value optimized out>) at write-behind.c:338

#8  0x00002b803ab6a1e0 in iot_writev_cbk (frame=0x2aaab00309d0,

cookie=<value optimized out>, this=<value optimized out>, op_ret=19, 
op_errno=0,

stbuf=0x7fff717a5590) at io-threads.c:1186

#9  0x00002b803a953aae in dht_writev_cbk (frame=0x63e3e0, 
cookie=<value optimized out>,

this=<value optimized out>, op_ret=19, op_errno=0, 
stbuf=0x7fff717a5590)

at dht-common.c:1797

#10 0x00002b803a7406e9 in client_write_cbk (frame=0x648a80, 
hdr=<value optimized out>,

hdrlen=<value optimized out>, iobuf=<value optimized out>) at 
client-protocol.c:4363

#11 0x00002b803a72c83a in protocol_client_pollin (this=0x60ec70, 
trans=0x61a380)

at client-protocol.c:6230

#12 0x00002b803a7370bc in notify (this=0x4, event=<value optimized 
out>, data=0x61a380)

at client-protocol.c:6274

#13 0x00002b8039533183 in xlator_notify (xl=0x60ec70, event=2, 
data=0x61a380)

at xlator.c:820

#14 0x00002aaaaaaaff0b in socket_event_handler (fd=<value optimized 
out>, idx=4,

data=0x61a380, poll_in=1, poll_out=0, poll_err=0) at socket.c:813

#15 0x00002b803954b2aa in event_dispatch_epoll (event_pool=0x6094f0)
 at event.c:804

#16 0x0000000000403f34 in main (argc=6, argv=0x7fff717a64f8) at 
glusterfsd.c:1223

----------

Later glusterFS crashed again with different backtrace:

----------

Core was generated by `glusterfs -f /etc/glusterfs/client.vol -l 
/var/log/glusterfs/client.log /home'.

Program terminated with signal 6, Aborted.

#0  0x00002ae6dfcd4b45 in raise () from /lib64/libc.so.6

(gdb) bt

#0  0x00002ae6dfcd4b45 in raise () from /lib64/libc.so.6

#1  0x00002ae6dfcd60e0 in abort () from /lib64/libc.so.6

#2  0x00002ae6dfd0cfbb in ?? () from /lib64/libc.so.6

#3  0x00002ae6dfd1221d in ?? () from /lib64/libc.so.6

#4  0x00002ae6dfd13f76 in free () from /lib64/libc.so.6

#5  0x00002ae6df673efd in mem_put (pool=0x631a90, ptr=0x2aaaac0bc520)
 at mem-pool.c:191

#6  0x00002ae6e0c992ce in iot_dequeue_ordered (worker=0x631a20) at 
io-threads.c:2407

#7  0x00002ae6e0c99326 in iot_worker_ordered (arg=<value optimized 
out>)

at io-threads.c:2421

#8  0x00002ae6dfa8e020 in start_thread () from /lib64/libpthread.so.0

#9  0x00002ae6dfd68f8d in clone () from /lib64/libc.so.6

#10 0x0000000000000000 in ?? ()

----------

Hope this backtraces help to find an issue...

Best regards,

Andrey

_______________________________________________ Gluster-devel mailing
 list Gluster-devel@xxxxxxxxxx 
http://lists.nongnu.org/mailman/listinfo/gluster-devel