Hi cephers, I am testing the rdma module of ceph. The test environment is as following: Ceph version: 12.1.0 6 hosts and each host has 12 OSDs. Error is injected into the cluster by hand: 1. kill all OSD daemons in one host 2. restart all the OSD daemons killed just now. The problem is that OSDs in other hosts cannot get heartbeat reply from each other and marked down by the monitor wrongly. By analyse the log, I found that the OSDs from other hosts sents heartbeat to their peers, but the heartbeat could not be sent successfully because there doesn't have enough buffer: RDMAConnectedSocketImpl operator() no enough buffers in worker 0x7fd839c18d00 The memory buffer in RDMADispatcher will be released by the RDMADispatcher::polling() function. But when I killed all OSD daemons in one host and restarted then, the ratio of memory buffer release became slow and finally the number of inflight chunks reached 1023(max value is 1024): 2017-08-15 20:15:42.383778 7fd82641b700 30 RDMAStack post_tx_buffer release 1 chunks, inflight 1023 2017-08-15 20:15:42.384151 7fd82641b700 30 RDMAStack post_tx_buffer release 1 chunks, inflight 1023 2017-08-15 20:15:42.538885 7fd82641b700 30 RDMAStack post_tx_buffer release 1 chunks, inflight 1023 I think the root cause is related to the memory buffer release when error is injected. Do you have any ideas about this? Expect your response and thanks in advance. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html