There is a race b/w gf_timer_call_cancel and firing of timer addressed by [1]. Can this be the cause? Also note that [1] is not sufficient enough, as the callers of gf_timer_call_cancel should check return value and shouldn't free opaque pointer it passed to gf_timer_call_after during timer registration when gf_timer_call_cancel returns -1. Note that [1] is not in 3.7.3. [1] http://review.gluster.org/6459 ----- Original Message ----- > From: "Emmanuel Dreyfus" <manu@xxxxxxxxxx> > To: gluster-devel@xxxxxxxxxxx > Sent: Thursday, August 6, 2015 3:29:36 PM > Subject: SSL enabled glusterd crash > > On 3.7.3 with SSL enabled, restarting glusterd is quite unreliable, > with peers and bricks showing up or not in gluster status outputs. > And results can be different on different peers, and even not > symetrical: a peer sees the bricks of another but not the other > way around. > > After playing a bit, I managed to get a real crash on restarting > glusterd on all peers. 3 of them crash here: > > Program terminated with signal 11, Segmentation fault. > #0 0xbbbda1f4 in rpc_clnt_reconnect (conn_ptr=0xb9ce5150) at rpc-clnt.c:409 > 409 gf_timer_call_cancel (clnt->ctx, > #0 0xbbbda1f4 in rpc_clnt_reconnect (conn_ptr=0xb9ce5150) at rpc-clnt.c:409 > #1 0xbbb33d0c in gf_timer_proc (ctx=Cannot access memory at address > 0xba9fffd8) at timer.c:194 > (gdb) list > 404 if (!trans) { > 405 pthread_mutex_unlock (&conn->lock); > 406 return; > 407 } > 408 if (conn->reconnect) > 409 gf_timer_call_cancel (clnt->ctx, > 410 conn->reconnect); > 411 conn->reconnect = 0; > 412 > 413 if ((conn->connected == 0) && !clnt->disabled) { > (gdb) print clnt > $1 = (struct rpc_clnt *) 0x39bb > (gdb) print conn > $2 = (rpc_clnt_connection_t *) 0xb9ce5150 > (gdb) print conn->lock > $3 = {ptm_magic = 51200, ptm_errorcheck = 0 '\000', ptm_pad1 = "0Q\316", > ptm_interlock = 185 '\271', ptm_pad2 = "\336\300\255", > ptm_owner = 0x6af000de, ptm_waiters = 0x39bb, ptm_recursed = 51200, > ptm_spare2 = 0xce513000} > > ptm_magix is wrong. NetBSD libpthread sets it as 0x33330003 when created > and as 0xDEAD0003 when destroyed. This means we either have memory > corruption, or the mutex was never initialized. > > The last one crashes somewhere else: > > Program terminated with signal 11, Segmentation fault#0 0xbbb33e60 in > gf_timer_registry_init (ctx=0x80) at timer.c:241 > 241 if (!ctx->timer) { > (gdb) bt > #0 0xbbb33e60 in gf_timer_registry_init (ctx=0x80) at timer.c:241 > #1 0xbbb339ce in gf_timer_call_cancel (ctx=0x80, event=0xb9dffb24) > at timer.c:121 > #2 0xbbbda206 in rpc_clnt_reconnect (conn_ptr=0xb9ce9150) at rpc-clnt.c:409 > #3 0xbbb33d0c in gf_timer_proc (ctx=Cannot access memory at > address 0xba9fffd8) at timer.c:194 > (gdb) print ctx > $1 = (glusterfs_ctx_t *) 0x80 > (gdb) frame 2 > #2 0xbbbda206 in rpc_clnt_reconnect (conn_ptr=0xb9ce9150) at rpc-clnt.c:409 > 409 gf_timer_call_cancel (clnt->ctx, > (gdb) print clnt > $2 = (struct rpc_clnt *) 0xb9dffd94 > (gdb) print clnt->lock.ptm_magic > $3 = 1 > > Here again, corrupted or not initialized. > > > I kept the cores for further investigation if this is needed. > > -- > Emmanuel Dreyfus > manu@xxxxxxxxxx > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxxx > http://www.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel