Re: Ceph RDMA Memory Leakage

Haomai Wang <haomai@xxxxxxxx> · Mon, 18 Sep 2017 18:13:41 +0800

which version do you use? I think we have fixed some memory problem on masater

On Mon, Sep 18, 2017 at 2:09 PM, Jin Cai <caijin.laurence@xxxxxxxxx> wrote:
> Hi, cephers
>
>     We are testing the RDMA ms type of Ceph.
>
>     The OSDs and MONs are always marked down by their peers because
> they don't have enough buffer to use in the memory buffer pool to
> reply the heartbeat ping message from their peers.
>     And the log always shows "no enough buffer in worker" even though
> the whole cluster is idle without any I/Os from external.
>
>     Ceph configuration about RDMA is as following:
>         ms_async_rdma_roce_ver = 1
>         ms_async_rdma_sl = 5
>         ms_async_rdma_dscp = 136
>         ms_async_rdma_send_buffers = 1024
>         ms_async_rdma_receive_buffers = 1024
>
>    Even we adjust the value of ms_async_rdma_send_buffers to 32,768,
> the 'no enough buffer in worker' log still exists.
>
>    After a deep analysis, we think it is because when a
> RDMAConnectedSocketImpl instance is destructed, its queue pair is
> added to the dead_queue_pair vector container. And the items of
> dead_queue_pair are deleted in the polling thread.
>
> From the doc of rdmamojo:
> When a QP is destroyed any outstanding Work Requests, in either the
> Send or Receive Queues, won't be processed anymore by the RDMA device
> and Work Completions won't be generated for them. It is up to the user
> to clean all of the associated resources of those Work Requests (i.e.
> memory buffers)
>
> We can know the problem here is that when there are still outstanding
> work request in the queue pair to be deleted, the memory buffer
> occupied by these outstanding work request will never be returned to
> memory buffer pool because work completions won't be generated for
> them. So the memory leakage happens.
>
> A more elegant way before destroying a queue pair is set the queue
> pair into error state and wait for the affiliated event
> IBV_EVENT_QP_LAST_WQE_REACHED, finally destroy the queue pair.
>
> Do you have any suggestions or ideas? Thanks in advance.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html