which version do you use? I think we have fixed some memory problem on masater On Mon, Sep 18, 2017 at 2:09 PM, Jin Cai <caijin.laurence@xxxxxxxxx> wrote: > Hi, cephers > > We are testing the RDMA ms type of Ceph. > > The OSDs and MONs are always marked down by their peers because > they don't have enough buffer to use in the memory buffer pool to > reply the heartbeat ping message from their peers. > And the log always shows "no enough buffer in worker" even though > the whole cluster is idle without any I/Os from external. > > Ceph configuration about RDMA is as following: > ms_async_rdma_roce_ver = 1 > ms_async_rdma_sl = 5 > ms_async_rdma_dscp = 136 > ms_async_rdma_send_buffers = 1024 > ms_async_rdma_receive_buffers = 1024 > > Even we adjust the value of ms_async_rdma_send_buffers to 32,768, > the 'no enough buffer in worker' log still exists. > > After a deep analysis, we think it is because when a > RDMAConnectedSocketImpl instance is destructed, its queue pair is > added to the dead_queue_pair vector container. And the items of > dead_queue_pair are deleted in the polling thread. > > From the doc of rdmamojo: > When a QP is destroyed any outstanding Work Requests, in either the > Send or Receive Queues, won't be processed anymore by the RDMA device > and Work Completions won't be generated for them. It is up to the user > to clean all of the associated resources of those Work Requests (i.e. > memory buffers) > > We can know the problem here is that when there are still outstanding > work request in the queue pair to be deleted, the memory buffer > occupied by these outstanding work request will never be returned to > memory buffer pool because work completions won't be generated for > them. So the memory leakage happens. > > A more elegant way before destroying a queue pair is set the queue > pair into error state and wait for the affiliated event > IBV_EVENT_QP_LAST_WQE_REACHED, finally destroy the queue pair. > > Do you have any suggestions or ideas? Thanks in advance. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html