tips/nest practices for gluster rdma?

matthew_nicholson at harvard.edu (Matthew Nicholson) · Wed, 10 Jul 2013 16:39:42 -0400

Ryan,

10(storage) nodes, I did some test w 1 brick per node, and another round w/
4 per node. Each is FDR connected, but all on the same switch.

I'd love to hear about your setup, gluster version, OFED stack etc....

--
Matthew Nicholson
Research Computing Specialist
Harvard FAS Research Computing
matthew_nicholson at harvard.edu

On Wed, Jul 10, 2013 at 4:33 PM, Ryan Aydelott <ryade at mcs.anl.gov> wrote:

> How many nodes make up that volume that you were using for testing?
>
> Over 100 nodes running at QDR/IPoIB using 100 threads we we ran around
> 60GB/s read and somewhere in the 40GB/s for writes (iirc).
>
> On Jul 10, 2013, at 1:49 PM, Matthew Nicholson <
> matthew_nicholson at harvard.edu> wrote:
>
> Well, first of all,thank for the responses. The volume WAS failing over
> the tcp just as predicted,though WHY is unclear as the fabric is know
> working (has about 28K compute cores on it all doing heavy MPI testing on
> it), and the OFED/verbs stack is consistent across all client/storage
> systems (actually, the OS image is identical).
>
> Thats quiet sad RDMA isn't going to make 3.4. We put a good deal of hopes
> and effort around planning for 3.4 for this storage systems, specifically
> for RDMA support (well, with warnings to the team that it wasn't in/test
> for 3.3 and that all we could do was HOPE it was in 3.4 and in time for
> when we want to go live). we're getting "okay" performance out of IPoIB
> right now, and our bottle neck actually seems to be the fabric
> design/layout, as we're peaking at about 4.2GB/s writing 10TB over 160
> threads to this distributed volume.
>
> When it IS ready and in 3.4.1 (hopefully!), having good docs around it,
> and maybe even a simple printf for the tcp failover would be huge for us.
>
>
>
> --
> Matthew Nicholson
> Research Computing Specialist
> Harvard FAS Research Computing
> matthew_nicholson at harvard.edu
>
>
>
> On Wed, Jul 10, 2013 at 3:18 AM, Justin Clift <jclift at redhat.com> wrote:
>
>> Hi guys,
>>
>> As an FYI, from discussion on gluster-devel IRC yesterday, the RDMA code
>> still isn't in a good enough state for production usage with 3.4.0. :(
>>
>> There are still outstanding bugs with it, and I'm working to make the
>> Gluster Test Framework able to work with RDMA so we can help shake out
>> more of them:
>>
>>
>> http://www.gluster.org/community/documentation/index.php/Using_the_Gluster_Test_Framework
>>
>> Hopefully RDMA will be ready for 3.4.1, but don't hold me to that at
>> this stage. :)
>>
>> Regards and best wishes,
>>
>> Justin Clift
>>
>>
>> On 09/07/2013, at 8:36 PM, Ryan Aydelott wrote:
>> > Matthew,
>> >
>> > Personally - I have experienced this same problem (even with the mount
>> being something.rdma). Running 3.4beta4, if I mounted a volume via RDMA
>> that also had TCP configured as a transport option (which obviously you do
>> based on the mounts you gave below), if there is ANY issue with RDMA not
>> working the mount will silently fall back to TCP. This problem is described
>> here: https://bugzilla.redhat.com/show_bug.cgi?id=982757
>> >
>> > The way to test for this behavior is create a new volume specifying
>> ONLY RDMA as the transport. If you mount this and your RDMA is broken for
>> whatever reason - it will simply fail to mount.
>> >
>> > Assuming this test fails, I would then tail the logs for the volume to
>> get a hint of what's going on. In my case there was an RDMA_CM kernel
>> module that was not loaded which started to matter as of 3.4beta2 IIRC as
>> they did a complete rewrite for this based on poor performance in prior
>> releases. The clue in my volume log file was "no such file or directory"
>> preceded with an rdma_cm.
>> >
>> > Hope that helps!
>> >
>> >
>> > -ryan
>> >
>> >
>> > On Jul 9, 2013, at 2:03 PM, Matthew Nicholson <
>> matthew_nicholson at harvard.edu> wrote:
>> >
>> >> Hey guys,
>> >>
>> >> So, we're testing Gluster RDMA storage, and are having some issues.
>> Things are working...just not as we expected them. THere isn't a whole lot
>> in the way, that I've foudn on docs for gluster rdma, aside from basically
>> "install gluster-rdma", create a volume with transport=rdma, and mount w/
>> transport=rdma....
>> >>
>> >> I've done that...and the IB fabric is known to be good...however, a
>> volume created with transport=rdma,tcp and mounted w/ transport=rdma, still
>> seems to go over tcp?
>> >>
>> >> A little more info about the setup:
>> >>
>> >> we've got 10 storage nodes/bricks, each of which has a single 1GB NIC
>> and a FRD IB port. Likewise for the test clients. Now, the 1GB nic is for
>> management only, and we have all of the systems on this fabric configured
>> with IPoIB, so there is eth0, and ib0 on each node.
>> >>
>> >> All storage nodes are peer'd using the ib0 interface, ie:
>> >>
>> >> gluster peer probe storage_node01-ib
>> >> etc
>> >>
>> >> thats all well and good.
>> >>
>> >> Volume was created:
>> >>
>> >> gluster volume create holyscratch transport rdma,tcp
>> holyscratch01-ib:/holyscratch01/brick
>> >> for i in `seq -w 2 10` ; do gluster volume add-brick holyscratch
>> holyscratch${i}-ib:/holyscratch${i}/brick; done
>> >>
>> >> yielding:
>> >>
>> >> Volume Name: holyscratch
>> >> Type: Distribute
>> >> Volume ID: 788e74dc-6ae2-4aa5-8252-2f30262f0141
>> >> Status: Started
>> >> Number of Bricks: 10
>> >> Transport-type: tcp,rdma
>> >> Bricks:
>> >> Brick1: holyscratch01-ib:/holyscratch01/brick
>> >> Brick2: holyscratch02-ib:/holyscratch02/brick
>> >> Brick3: holyscratch03-ib:/holyscratch03/brick
>> >> Brick4: holyscratch04-ib:/holyscratch04/brick
>> >> Brick5: holyscratch05-ib:/holyscratch05/brick
>> >> Brick6: holyscratch06-ib:/holyscratch06/brick
>> >> Brick7: holyscratch07-ib:/holyscratch07/brick
>> >> Brick8: holyscratch08-ib:/holyscratch08/brick
>> >> Brick9: holyscratch09-ib:/holyscratch09/brick
>> >> Brick10: holyscratch10-ib:/holyscratch10/brick
>> >> Options Reconfigured:
>> >> nfs.disable: on
>> >>
>> >>
>> >> For testing, we wanted to see how rdma stacked up vs tcp using IPoIB,
>> so we mounted this like:
>> >>
>> >> [root at holy2a01202 holyscratch.tcp]# df -h |grep holyscratch
>> >> holyscratch:/holyscratch
>> >>                       273T  4.1T  269T   2% /n/holyscratch.tcp
>> >> holyscratch:/holyscratch.rdma
>> >>                       273T  4.1T  269T   2% /n/holyscratch.rdma
>> >>
>> >> so, 2 mounts, same volume different transports. fstab looks like:
>> >>
>> >> holyscratch:/holyscratch        /n/holyscratch.tcp      glusterfs
>>   transport=tcp,fetch-attempts=10,gid-timeout=2,acl,_netdev       0       0
>> >> holyscratch:/holyscratch        /n/holyscratch.rdma     glusterfs
>>   transport=rdma,fetch-attempts=10,gid-timeout=2,acl,_netdev      0       0
>> >>
>> >> where holyscratch is a RRDNS entry for all the IPoIB interfaces for
>> fetching the volfile (something it seems, just like peering, MUST be tcp? )
>> >>
>> >> but, again, when running just dumb,dumb,dumb tests (160 threads of dd
>> over 8 nodes w/ each thread writing 64GB, so a 10TB throughput test), I'm
>> seeing all the traffic on the IPoIB interface for both RDMA and TCP
>> transports...when i really shouldn't be seeing ANY tcp traffic, aside from
>> volfile fetches/management on the IPoIB interface when using RDMA as a
>> transport...right? As a result, from early tests (the bigger 10TB ones are
>> running now), the tpc and rdma speeds were basically the same...when i
>> would expect the RDMA one to be at least slightly faster...
>> >>
>> >>
>> >> Oh, and this is all 3.4beta4, on both the clients and storage nodes.
>> >>
>> >> So, I guess my questions are:
>> >>
>> >> Is this expected/normal?
>> >> Is peering/volfile fetching always tcp based?
>> >> How should one peer nodes in a RDMA setup?
>> >> Should this be tried with only RDMA as a transport on the volume?
>> >> Are there more detailed docs for RDMA gluster coming w/ the 3.4
>> release?
>> >>
>> >>
>> >> --
>> >> Matthew Nicholson
>> >> Research Computing Specialist
>> >> Harvard FAS Research Computing
>> >> matthew_nicholson at harvard.edu
>> >>
>> >> _______________________________________________
>> >> Gluster-users mailing list
>> >> Gluster-users at gluster.org
>> >> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>> >
>> > _______________________________________________
>> > Gluster-users mailing list
>> > Gluster-users at gluster.org
>> > http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>
>> --
>> Open Source and Standards @ Red Hat
>>
>> twitter.com/realjustinclift
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130710/0101c93f/attachment-0001.html>