Сrash - 2.0.git-2009.06.16

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi everybody!



Recently I've migrated our small 24-node HPC-cluster from glusterFS
1.3.8 unify to 2.0 distribute. It seems that performance really
increased a lot. Thanks for your work!

I use the following translators. On servers:
posix->locks->iothreads->protocol/server; on clients:
protocol/client->distribute->iothreads->write-behind. IO-threads
translator uses 4 threads, NO autoscaling.



Unfortunately, after upgrade I've got new issues. First, I've noticed
a very high memory usage. Now GlusterFS on the head node eats 737Mb of
RES memory and doesnt return it back. The memory usage have been
increased in the migration process by the command "cd
${namespace_export} && find . | (cd ${distribute_mount} && xargs -d
'\n' stat -c '%n')". Note that provided migrate-unify-to-distribute.sh
script (with "execute_on" function) doesn't work...



Second problem is more important. A client on one of the nodes has
crashed today with the following backtrace:

------

Core was generated by `glusterfs -f /etc/glusterfs/client.vol -l
/var/log/glusterfs/client.log /home'.

Program terminated with signal 11, Segmentation fault.

#0  0x00002b8039bec860 in ?? () from /lib64/libc.so.6

(gdb) bt

#0  0x00002b8039bec860 in ?? () from /lib64/libc.so.6

#1  0x00002b8039bedc0c in malloc () from /lib64/libc.so.6

#2  0x00002b8039548732 in fop_writev_stub (frame=<value optimized out>,

    fn=0x2b803ab6c160 <iot_writev_wrapper>, fd=0x2aaab001e8a0,
vector=0x2aaab0071d50,

    count=<value optimized out>, off=105432, iobref=0x2aaab0082d60) at
common-utils.h:166

#3  0x00002b803ab6ec00 in iot_writev (frame=0x4, this=0x6150c0,
fd=0x2aaab0082711,

    vector=0x2aaab0083060, count=3, offset=105432, iobref=0x2aaab0082d60)

    at io-threads.c:1212

#4  0x00002b803ad7a3de in wb_sync (frame=0x2aaab0034c40, file=0x2aaaac007280,

    winds=0x7fff717a5450) at write-behind.c:445

#5  0x00002b803ad7a4ff in wb_do_ops (frame=0x2aaab0034c40, file=0x2aaaac007280,

    winds=0x7fff717a5450, unwinds=<value optimized out>,
other_requests=0x7fff717a5430)

    at write-behind.c:1579

#6  0x00002b803ad7a617 in wb_process_queue (frame=0x2aaab0034c40,
file=0x2aaaac007280,

    flush_all=0 '\0') at write-behind.c:1624

#7  0x00002b803ad7dd81 in wb_sync_cbk (frame=0x2aaab0034c40,

    cookie=<value optimized out>, this=<value optimized out>,
op_ret=19, op_errno=0,

    stbuf=<value optimized out>) at write-behind.c:338

#8  0x00002b803ab6a1e0 in iot_writev_cbk (frame=0x2aaab00309d0,

    cookie=<value optimized out>, this=<value optimized out>,
op_ret=19, op_errno=0,

    stbuf=0x7fff717a5590) at io-threads.c:1186

#9  0x00002b803a953aae in dht_writev_cbk (frame=0x63e3e0,
cookie=<value optimized out>,

    this=<value optimized out>, op_ret=19, op_errno=0, stbuf=0x7fff717a5590)

    at dht-common.c:1797

#10 0x00002b803a7406e9 in client_write_cbk (frame=0x648a80, hdr=<value
optimized out>,

    hdrlen=<value optimized out>, iobuf=<value optimized out>) at
client-protocol.c:4363

#11 0x00002b803a72c83a in protocol_client_pollin (this=0x60ec70, trans=0x61a380)

    at client-protocol.c:6230

#12 0x00002b803a7370bc in notify (this=0x4, event=<value optimized
out>, data=0x61a380)

    at client-protocol.c:6274

#13 0x00002b8039533183 in xlator_notify (xl=0x60ec70, event=2, data=0x61a380)

    at xlator.c:820

#14 0x00002aaaaaaaff0b in socket_event_handler (fd=<value optimized out>, idx=4,

    data=0x61a380, poll_in=1, poll_out=0, poll_err=0) at socket.c:813

#15 0x00002b803954b2aa in event_dispatch_epoll (event_pool=0x6094f0)
at event.c:804

#16 0x0000000000403f34 in main (argc=6, argv=0x7fff717a64f8) at
glusterfsd.c:1223

----------



Later glusterFS crashed again with different backtrace:

----------

Core was generated by `glusterfs -f /etc/glusterfs/client.vol -l
/var/log/glusterfs/client.log /home'.

Program terminated with signal 6, Aborted.

#0  0x00002ae6dfcd4b45 in raise () from /lib64/libc.so.6

(gdb) bt

#0  0x00002ae6dfcd4b45 in raise () from /lib64/libc.so.6

#1  0x00002ae6dfcd60e0 in abort () from /lib64/libc.so.6

#2  0x00002ae6dfd0cfbb in ?? () from /lib64/libc.so.6

#3  0x00002ae6dfd1221d in ?? () from /lib64/libc.so.6

#4  0x00002ae6dfd13f76 in free () from /lib64/libc.so.6

#5  0x00002ae6df673efd in mem_put (pool=0x631a90, ptr=0x2aaaac0bc520)
at mem-pool.c:191

#6  0x00002ae6e0c992ce in iot_dequeue_ordered (worker=0x631a20) at
io-threads.c:2407

#7  0x00002ae6e0c99326 in iot_worker_ordered (arg=<value optimized out>)

    at io-threads.c:2421

#8  0x00002ae6dfa8e020 in start_thread () from /lib64/libpthread.so.0

#9  0x00002ae6dfd68f8d in clone () from /lib64/libc.so.6

#10 0x0000000000000000 in ?? ()

----------



Hope this backtraces help to find an issue...



Best regards,

  Andrey




[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux