Emmanuel Dreyfus <manu@xxxxxxxxxx> wrote: > Is it an event-poll story? When the backtrace hits even-pool.c, > glusterfs has already decided to destroy the transport because its > refcount droped to zero. It looks more like a refcount tracking issue, > which is always done outside of event-poll code. Here is rpc_transport_(ref|unref) tracking prior the bug: 1367153818.290418 rpcsvc_request_init 0x7f7ff73b6000 1 -> 2 1367153818.290426 server_alloc_frame 0x7f7ff73b6000 2 -> 3 1367153818.290449 rpcsvc_request_init 0x7f7ff73b6000 3 -> 4 1367153818.290455 server_alloc_frame 0x7f7ff73b6000 4 -> 5 1367153818.291074 rpcsvc_request_init 0x7f7ff73b6000 5 -> 6 1367153818.291076 server_alloc_frame 0x7f7ff73b6000 6 -> 7 1367153818.291100 rpcsvc_request_init 0x7f7ff73b6000 7 -> 8 1367153818.291103 server_alloc_frame 0x7f7ff73b6000 8 -> 9 1367153818.291104 rpcsvc_request_destroy 0x7f7ff73b6000 9 -> 8 1367153818.291104 free_state 0x7f7ff73b6000 8 -> 7 1367153818.291112 rpcsvc_request_init 0x7f7ff73b6000 7 -> 8 1367153818.291114 rpcsvc_request_destroy 0x7f7ff73b6000 8 -> 7 1367153818.291115 server_alloc_frame 0x7f7ff73b6000 7 -> 6 1367153818.291127 rpcsvc_request_destroy 0x7f7ff73b6000 6 -> 5 1367153818.291127 free_state 0x7f7ff73b6000 5 -> 4 1367153818.291987 rpcsvc_request_destroy 0x7f7ff73b6000 4 -> 3 1367153818.291987 free_state 0x7f7ff73b6000 3 -> 2 1367153818.293152 rpcsvc_request_destroy 0x7f7ff73b6000 2 -> 1 1367153818.293152 free_state 0x7f7ff73b6000 1 -> 0 This explains the reference dropping to 0, but it is odd, I just don't get it, it makes no sense: 1367153818.291115 server_alloc_frame 0x7f7ff73b6000 7 -> 6 I double-checked my debug code and it seems correct. server_alloc_frame() contains that call where I added the __func__ argument for debugging: state->xprt = rpc_transport_ref (req->trans, __func__); Here is my modified rpc_transport_ref(): rpc_transport_t * rpc_transport_ref (rpc_transport_t *this, char *func) { rpc_transport_t *return_this = NULL; GF_VALIDATE_OR_GOTO("rpc_transport", this, fail); pthread_mutex_lock (&this->lock); { this->debug[this->debugcount].before = this->refcount; this->refcount ++; this->debug[this->debugcount].after = this->refcount; gettimeofday(&this->debug[this->debugcount].tv, NULL); this->debug[this->debugcount].func = func; this->debug[this->debugcount].trans = this; this->debugcount = (this->debugcount + 1) % DEBUG_SIZE; } pthread_mutex_unlock (&this->lock); return_this = this; fail: return return_this; } And rpc_transport_destroy(): int32_t rpc_transport_destroy (rpc_transport_t *this) { int32_t ret = -1; GF_VALIDATE_OR_GOTO("rpc_transport", this, fail); { int i, c; for (i = 0; i < DEBUG_SIZE; i++) { c = (i + this->debugcount + 1) % DEBUG_SIZE; gf_log("XXXref", GF_LOG_ERROR, "%d.%06d %s %p %d -> %d", this->debug[c].tv.tv_sec, this->debug[c].tv.tv_usec, this->debug[c].func, this->debug[c].trans, this->debug[c].before, this->debug[c].after); } } /* ... */ -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu@xxxxxxxxxx