On 3.7.3 with SSL enabled, restarting glusterd is quite unreliable, with peers and bricks showing up or not in gluster status outputs. And results can be different on different peers, and even not symetrical: a peer sees the bricks of another but not the other way around. After playing a bit, I managed to get a real crash on restarting glusterd on all peers. 3 of them crash here: Program terminated with signal 11, Segmentation fault. #0 0xbbbda1f4 in rpc_clnt_reconnect (conn_ptr=0xb9ce5150) at rpc-clnt.c:409 409 gf_timer_call_cancel (clnt->ctx, #0 0xbbbda1f4 in rpc_clnt_reconnect (conn_ptr=0xb9ce5150) at rpc-clnt.c:409 #1 0xbbb33d0c in gf_timer_proc (ctx=Cannot access memory at address 0xba9fffd8) at timer.c:194 (gdb) list 404 if (!trans) { 405 pthread_mutex_unlock (&conn->lock); 406 return; 407 } 408 if (conn->reconnect) 409 gf_timer_call_cancel (clnt->ctx, 410 conn->reconnect); 411 conn->reconnect = 0; 412 413 if ((conn->connected == 0) && !clnt->disabled) { (gdb) print clnt $1 = (struct rpc_clnt *) 0x39bb (gdb) print conn $2 = (rpc_clnt_connection_t *) 0xb9ce5150 (gdb) print conn->lock $3 = {ptm_magic = 51200, ptm_errorcheck = 0 '\000', ptm_pad1 = "0Q\316", ptm_interlock = 185 '\271', ptm_pad2 = "\336\300\255", ptm_owner = 0x6af000de, ptm_waiters = 0x39bb, ptm_recursed = 51200, ptm_spare2 = 0xce513000} ptm_magix is wrong. NetBSD libpthread sets it as 0x33330003 when created and as 0xDEAD0003 when destroyed. This means we either have memory corruption, or the mutex was never initialized. The last one crashes somewhere else: Program terminated with signal 11, Segmentation fault#0 0xbbb33e60 in gf_timer_registry_init (ctx=0x80) at timer.c:241 241 if (!ctx->timer) { (gdb) bt #0 0xbbb33e60 in gf_timer_registry_init (ctx=0x80) at timer.c:241 #1 0xbbb339ce in gf_timer_call_cancel (ctx=0x80, event=0xb9dffb24) at timer.c:121 #2 0xbbbda206 in rpc_clnt_reconnect (conn_ptr=0xb9ce9150) at rpc-clnt.c:409 #3 0xbbb33d0c in gf_timer_proc (ctx=Cannot access memory at address 0xba9fffd8) at timer.c:194 (gdb) print ctx $1 = (glusterfs_ctx_t *) 0x80 (gdb) frame 2 #2 0xbbbda206 in rpc_clnt_reconnect (conn_ptr=0xb9ce9150) at rpc-clnt.c:409 409 gf_timer_call_cancel (clnt->ctx, (gdb) print clnt $2 = (struct rpc_clnt *) 0xb9dffd94 (gdb) print clnt->lock.ptm_magic $3 = 1 Here again, corrupted or not initialized. I kept the cores for further investigation if this is needed. -- Emmanuel Dreyfus manu@xxxxxxxxxx _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel