Re: State of play for RDMA on Luminous

Florian Haas <florian@xxxxxxxxxxx> · Mon, 28 Aug 2017 16:54:26 +0200

On Mon, Aug 28, 2017 at 4:21 PM, Haomai Wang <haomai@xxxxxxxx> wrote:
> On Wed, Aug 23, 2017 at 1:26 AM, Florian Haas <florian@xxxxxxxxxxx> wrote:
>> Hello everyone,
>>
>> I'm trying to get a handle on the current state of the async messenger's
>> RDMA transport in Luminous, and I've noticed that the information
>> available is a little bit sparse (I've found
>> https://community.mellanox.com/docs/DOC-2693 and
>> https://community.mellanox.com/docs/DOC-2721, which are a great start
>> but don't look very complete). So I'm kicking off this thread that might
>> hopefully bring interested parties and developers together.
>>
>> Could someone in the know please confirm that the following assumptions
>> of mine are accurate:
>>
>> - RDMA support for the async messenger is available in Luminous.
>
> to be precious, rdma in luminous is available but lack of memory
> control when under pressure. it would be ok to run for test purpose.

OK, thanks! Assuming async+rdma will become fully supported some time
in the next release or two, are there plans to backport async+rdma
related features to Luminous? Or will users likely need to wait for
the next release to get a production-grade Ceph/RDMA stack?

>> - You enable it globally by setting ms_type to "async+rdma", and by
>> setting appropriate values for the various ms_async_rdma* options (most
>> importantly, ms_async_rdma_device_name).
>>
>> - You can also set RDMA messaging just for the public or cluster
>> network, via ms_public_type and ms_cluster_type.
>>
>> - Users have to make a global async+rdma vs. async+posix decision on
>> either network. For example, if either ms_type or ms_public_type is
>> configured to async+rdma on cluster nodes, then a client configured with
>> ms_type = async+posix can't communicate.
>>
>> Based on those assumptions, I have the following questions:
>>
>> - What is the current state of RDMA support in kernel libceph? In other
>> words, is there currently a way to map RBDs, or mount CephFS, if a Ceph
>> cluster uses RDMA messaging?
>
> no planning on kernel side so far. rbd-nbd, cephfs-fuse should be supported now.

Understood — are there plans to support async+rdma in the kernel at
all, or is there something in the kernel that precludes this?

>> - In case there is no such support in the kernel yet: What's the current
>> status of RDMA support (and testing) with regard to
>>   * libcephfs?
>
> libcephfs should be ok, but mds has some potential problems but no
> verified recently, because it uses some different and tricky messenger
> methods. I'm not sure it still exists.
>
>>   * the Samba Ceph VFS?
>
> no testing
>
>>   * nfs-ganesha?
>
> no testing
>
>>   * tcmu-runner?
>
> I have received other user report that tcum-runner has conflicting
> with ibverbs deps, netlink library version
>
>>
>> - In summary, if a user wants to access their Ceph cluster via a POSIX
>> filesystem or via iSCSI, is enabling the RDMA-enabled async messenger in
>> the public network an option? Or would they have to continue running on
>> TCP/IP (possibly on IPoIB if they already have InfiniBand hardware)
>> until the client libraries catch up?
>
> any try is welcomed.

OK. But for now, would you agree that *production* systems with IB
HCAs should use IPoIB, and async+posix?

>> - And more broadly, if a user wants to use the performance benefits of
>> RDMA, but not all of their potential Ceph clients have InfiniBand HCAs,
>> what are their options? RoCE?
>
> roce v2 is supported

Thanks!

Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com