Re: State of play for RDMA on Luminous

Haomai Wang <haomai@xxxxxxxx> · Mon, 28 Aug 2017 09:04:46 -0700

On Mon, Aug 28, 2017 at 7:54 AM, Florian Haas <florian@xxxxxxxxxxx> wrote:
> On Mon, Aug 28, 2017 at 4:21 PM, Haomai Wang <haomai@xxxxxxxx> wrote:
>> On Wed, Aug 23, 2017 at 1:26 AM, Florian Haas <florian@xxxxxxxxxxx> wrote:
>>> Hello everyone,
>>>
>>> I'm trying to get a handle on the current state of the async messenger's
>>> RDMA transport in Luminous, and I've noticed that the information
>>> available is a little bit sparse (I've found
>>> https://community.mellanox.com/docs/DOC-2693 and
>>> https://community.mellanox.com/docs/DOC-2721, which are a great start
>>> but don't look very complete). So I'm kicking off this thread that might
>>> hopefully bring interested parties and developers together.
>>>
>>> Could someone in the know please confirm that the following assumptions
>>> of mine are accurate:
>>>
>>> - RDMA support for the async messenger is available in Luminous.
>>
>> to be precious, rdma in luminous is available but lack of memory
>> control when under pressure. it would be ok to run for test purpose.
>
> OK, thanks! Assuming async+rdma will become fully supported some time
> in the next release or two, are there plans to backport async+rdma
> related features to Luminous? Or will users likely need to wait for
> the next release to get a production-grade Ceph/RDMA stack?

I think so

>
>>> - You enable it globally by setting ms_type to "async+rdma", and by
>>> setting appropriate values for the various ms_async_rdma* options (most
>>> importantly, ms_async_rdma_device_name).
>>>
>>> - You can also set RDMA messaging just for the public or cluster
>>> network, via ms_public_type and ms_cluster_type.
>>>
>>> - Users have to make a global async+rdma vs. async+posix decision on
>>> either network. For example, if either ms_type or ms_public_type is
>>> configured to async+rdma on cluster nodes, then a client configured with
>>> ms_type = async+posix can't communicate.
>>>
>>> Based on those assumptions, I have the following questions:
>>>
>>> - What is the current state of RDMA support in kernel libceph? In other
>>> words, is there currently a way to map RBDs, or mount CephFS, if a Ceph
>>> cluster uses RDMA messaging?
>>
>> no planning on kernel side so far. rbd-nbd, cephfs-fuse should be supported now.
>
> Understood — are there plans to support async+rdma in the kernel at
> all, or is there something in the kernel that precludes this?

no.

>
>>> - In case there is no such support in the kernel yet: What's the current
>>> status of RDMA support (and testing) with regard to
>>>   * libcephfs?
>>
>> libcephfs should be ok, but mds has some potential problems but no
>> verified recently, because it uses some different and tricky messenger
>> methods. I'm not sure it still exists.
>>
>>>   * the Samba Ceph VFS?
>>
>> no testing
>>
>>>   * nfs-ganesha?
>>
>> no testing
>>
>>>   * tcmu-runner?
>>
>> I have received other user report that tcum-runner has conflicting
>> with ibverbs deps, netlink library version
>>
>>>
>>> - In summary, if a user wants to access their Ceph cluster via a POSIX
>>> filesystem or via iSCSI, is enabling the RDMA-enabled async messenger in
>>> the public network an option? Or would they have to continue running on
>>> TCP/IP (possibly on IPoIB if they already have InfiniBand hardware)
>>> until the client libraries catch up?
>>
>> any try is welcomed.
>
> OK. But for now, would you agree that *production* systems with IB
> HCAs should use IPoIB, and async+posix?

at this time, ipoib is prefer to use at production

>
>>> - And more broadly, if a user wants to use the performance benefits of
>>> RDMA, but not all of their potential Ceph clients have InfiniBand HCAs,
>>> what are their options? RoCE?
>>
>> roce v2 is supported
>
> Thanks!
>
> Cheers,
> Florian
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com