Re: Spurious regression of tests/basic/mgmt_v3-locks.t

Atin Mukherjee <amukherj@xxxxxxxxxx> · Wed, 03 Dec 2014 21:53:17 +0530

On 12/03/2014 07:36 PM, Justin Clift wrote:
> On Tue, 02 Dec 2014 10:05:36 +0530
> Atin Mukherjee <amukherj@xxxxxxxxxx> wrote:
> <snip>
>> Its on my radar, I am in progress of analysing it. The last patch set
>> was on the clean up part of the test cases, I felt the changes could
>> have solved it, but I am afraid it didn't. I tried to reproduce it
>> multiple times in my local set up but couldn't. Initial analysis finds
>> out that one of glusterd node got disconnect when multiple volume set
>> transactions were in progress, reason is still unknown.  I will keep
>> you posted once I find any significant details.
> 
> The worrying thing (to me), is that this could be a bug that's happening
> for people in real world usage, and not just something happening in our
> regression testing.
Agreed Justin. It looks like a bug only.
I found something significant now. I wrote a small script to execute two
volume set commands in parallel (for two different volumes) in a loop of
1000 iterations and I figured out glusterd hung after few iterations.
Glusterd was not dead but it didn't respond at all.

Following is the backtrace:

Thread 6 (Thread 0x7f6f6ecf5700 (LWP 30417)):
#0  0x00007f6f76d67fbd in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f6f77a04014 in gf_timer_proc (ctx=0x1396010) at timer.c:170
#2  0x00007f6f76d60f33 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f6f766a7ded in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f6f6e4f4700 (LWP 30418)):
#0  0x00007f6f76d684f1 in sigwait () from /lib64/libpthread.so.0
#1  0x00000000004079e7 in glusterfs_sigwaiter (arg=<optimized out>) at
glusterfsd.c:1728
#2  0x00007f6f76d60f33 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f6f766a7ded in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f6f6dcf3700 (LWP 30419)):
#0  0x00007f6f76d6759d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f6f76d63194 in _L_lock_874 () from /lib64/libpthread.so.0
#2  0x00007f6f76d63093 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f6f766e2612 in _dl_addr () from /lib64/libc.so.6
#4  0x00007f6f766b67d5 in backtrace_symbols_fd () from /lib64/libc.so.6
#5  0x00007f6f779fdc88 in gf_backtrace_fillframes (
    buf=buf@entry=0x13eb280 "(-->
/usr/local/lib/libglusterfs.so.0(synctask_yield+0x2c)[0x7f6f77a21cac]
(--> /usr/local/lib/libglusterfs.so.0(+0x57dc9)[0x7f6f77a21dc9] (-->
/usr/local/lib/libglusterfs.so.0(synclock_lock+0x16)[0x7"...) at
common-utils.c:3335
---Type <return> to continue, or q <return> to quit---
#6  0x00007f6f77a03631 in gf_backtrace_save (
    buf=buf@entry=0x13eb280 "(-->
/usr/local/lib/libglusterfs.so.0(synctask_yield+0x2c)[0x7f6f77a21cac]
(--> /usr/local/lib/libglusterfs.so.0(+0x57dc9)[0x7f6f77a21dc9] (-->
/usr/local/lib/libglusterfs.so.0(synclock_lock+0x16)[0x7"...) at
common-utils.c:3391
#7  0x00007f6f77a21cac in synctask_yield (task=task@entry=0x13eadf0) at
syncop.c:296
#8  0x00007f6f77a21dc9 in __synclock_lock (lock=lock@entry=0x13e0138) at
syncop.c:760
#9  0x00007f6f77a24586 in synclock_lock (lock=lock@entry=0x13e0138) at
syncop.c:784
#10 0x00007f6f6ca3242d in glusterd_big_locked_cbk (req=0x13e34ec,
iov=0x0, count=0,
    myframe=0x7f6f759ff24c, fn=0x7f6f6ca766d0 <_gd_syncop_stage_op_cbk>)
    at glusterd-rpc-ops.c:207
#11 0x00007f6f777be9ff in rpc_clnt_submit (rpc=rpc@entry=0x13e3260,
    prog=prog@entry=0x7f6f6cce9ca0 <gd_mgmt_prog>, procnum=procnum@entry=3,
    cbkfn=cbkfn@entry=0x7f6f6ca75330 <gd_syncop_stage_op_cbk>,
    proghdr=proghdr@entry=0x17eb970, proghdrcount=proghdrcount@entry=1,
    progpayload=progpayload@entry=0x0,
progpayloadcount=progpayloadcount@entry=0,
    iobref=iobref@entry=0x7f6f4bd4a9c0, frame=frame@entry=0x7f6f759ff24c,
    rsphdr=rsphdr@entry=0x0, rsphdr_count=rsphdr_count@entry=0,
    rsp_payload=rsp_payload@entry=0x0,
rsp_payload_count=rsp_payload_count@entry=0,
    rsp_iobref=rsp_iobref@entry=0x0) at rpc-clnt.c:1601
#12 0x00007f6f6ca76470 in gd_syncop_submit_request (rpc=0x13e3260,
    req=req@entry=0x7f6f60239830, local=local@entry=0x17ebaa0,
    cookie=cookie@entry=0x13e2770, prog=0x7f6f6cce9ca0 <gd_mgmt_prog>,
    procnum=procnum@entry=3, cbkfn=cbkfn@entry=0x7f6f6ca75330
<gd_syncop_stage_op_cbk>,
---Type <return> to continue, or q <return> to quit---
    xdrproc=0x7f6f775a4910 <xdr_gd1_mgmt_stage_op_req>) at
glusterd-syncop.c:198
#13 0x00007f6f6ca773d6 in gd_syncop_mgmt_stage_op
(peerinfo=peerinfo@entry=0x13e2770,
    args=args@entry=0x17ebaa0,
    my_uuid=my_uuid@entry=0x13deff8
"y\205mKLMF\340\241\366\340\037#\227$x/var/lib/glusterd",
recv_uuid=recv_uuid@entry=0x17eba90 "", op=op@entry=11,
    dict_out=dict_out@entry=0x7f6f754045e4,
op_ctx=op_ctx@entry=0x7f6f754045e4)
    at glusterd-syncop.c:749
#14 0x00007f6f6ca77d4b in gd_stage_op_phase (peers=peers@entry=0x13defe0,
    op=<optimized out>, op_ctx=op_ctx@entry=0x7f6f754045e4,
req_dict=0x7f6f754045e4,
    op_errstr=op_errstr@entry=0x17ec2a8, npeers=npeers@entry=1) at
glusterd-syncop.c:1169
#15 0x00007f6f6ca790b8 in gd_sync_task_begin
(op_ctx=op_ctx@entry=0x7f6f754045e4,
    req=req@entry=0x13da238) at glusterd-syncop.c:1619
#16 0x00007f6f6ca7928c in glusterd_op_begin_synctask
(req=req@entry=0x13da238,
    op=op@entry=GD_OP_SET_VOLUME, dict=0x7f6f754045e4) at
glusterd-syncop.c:1680
#17 0x00007f6f6c9e52e6 in __glusterd_handle_set_volume
(req=req@entry=0x13da238)
    at glusterd-handler.c:1762
#18 0x00007f6f6c9f1d60 in glusterd_big_locked_handler (req=0x13da238,
    actor_fn=0x7f6f6c9e4dd0 <__glusterd_handle_set_volume>) at
glusterd-handler.c:79
#19 0x00007f6f77a21d32 in synctask_wrap (old_task=<optimized out>) at
syncop.c:335
#20 0x00007f6f765faa00 in ?? () from /lib64/libc.so.6
#21 0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7f6f6d4f2700 (LWP 30420)):
---Type <return> to continue, or q <return> to quit---
#0  0x00007f6f76d650c9 in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00007f6f77a237ef in syncenv_task (proc=proc@entry=0x13bfe90) at
syncop.c:535
#2  0x00007f6f77a24370 in syncenv_processor (thdata=0x13bfe90) at
syncop.c:613
#3  0x00007f6f76d60f33 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f6f766a7ded in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f6f6564a700 (LWP 30546)):
#0  0x00007f6f76d64d20 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x00007f6f6ca7a183 in hooks_worker (args=<optimized out>) at
glusterd-hooks.c:502
#2  0x00007f6f76d60f33 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f6f766a7ded in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f6f77e7c740 (LWP 30416)):
#0  0x00007f6f766e26bf in _dl_addr () from /lib64/libc.so.6
#1  0x00007f6f766b67d5 in backtrace_symbols_fd () from /lib64/libc.so.6
#2  0x00007f6f779fdc88 in gf_backtrace_fillframes (
    buf=buf@entry=0x1396378 "(-->
/usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x15c)[0x7f6f779ea1bc] (-->
/usr/local/lib/libgfrpc.so.0(saved_frames_unwind+0x1c2)[0x7f6f777bd942]
(--> /usr/local/lib/libgfrpc.so.0(saved_fram"...) at common-utils.c:3335
#3  0x00007f6f77a03631 in gf_backtrace_save (buf=buf@entry=0x0) at
common-utils.c:3391
#4  0x00007f6f779ea1bc in _gf_log_callingfn (domain=0x13e62b0 "management",
    file=file@entry=0x7f6f777c4746 "rpc-clnt.c",
    function=function@entry=0x7f6f777c4ff0 <__FUNCTION__.13438>
"saved_frames_unwind",
---Type <return> to continue, or q <return> to quit---
    line=line@entry=362, level=level@entry=GF_LOG_ERROR,
    fmt=fmt@entry=0x7f6f777c4930 "forced unwinding frame type(%s)
op(%s(%d)) called at %s (xid=0x%x)") at logging.c:822
#5  0x00007f6f777bd942 in saved_frames_unwind
(saved_frames=saved_frames@entry=0x13e71e0)
    at rpc-clnt.c:353
#6  0x00007f6f777bda3e in saved_frames_destroy (frames=0x13e71e0) at
rpc-clnt.c:383
#7  0x00007f6f777bf22c in rpc_clnt_connection_cleanup
(conn=conn@entry=0x13e3290)
    at rpc-clnt.c:536
#8  0x00007f6f777bf9c8 in rpc_clnt_notify (trans=<optimized out>,
mydata=0x13e3290,
    event=<optimized out>, data=<optimized out>) at rpc-clnt.c:843
#9  0x00007f6f777bbd33 in rpc_transport_notify (this=this@entry=0x13e62f0,
    event=event@entry=RPC_TRANSPORT_DISCONNECT, data=data@entry=0x13e62f0)
    at rpc-transport.c:516
#10 0x00007f6f66850962 in socket_event_poll_err (this=0x13e62f0) at
socket.c:1161
#11 socket_event_handler (fd=<optimized out>, idx=3,
data=data@entry=0x13e62f0, poll_in=1,
    poll_out=4, poll_err=<optimized out>) at socket.c:2353
#12 0x00007f6f77a3dd82 in event_dispatch_epoll_handler (i=<optimized
out>, events=0x13e0700,
    event_pool=0x13b43c0) at event-epoll.c:384
#13 event_dispatch_epoll (event_pool=0x13b43c0) at event-epoll.c:445
#14 0x0000000000404e89 in main (argc=1, argv=0x7fff7a929408) at
glusterfsd.c:2046

>From the backtrace I could figure out problematic area is in Thread 1 &
4, probably a deadlock. Need to analyse it, I will be in a training for
the entire week, will jump into it in early next week.

~Atin

> 
> Until we've figured it out / understand the root cause, we don't know.
> 
> :(
> 
> + Justin
> 
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel