Jeremy - What version of OFED are you running? Would you mind install version 1.5.2 from source? We have seen this resolve several issues of this type. http://www.openfabrics.org/downloads/OFED/ofed-1.5.2/ Thanks, Craig --> Craig Carl Senior Systems Engineer Gluster On 12/02/2010 10:05 AM, Jeremy Stout wrote: > An another follow-up, I tested several compilations today with > different values for send/receive count. I found the maximum value I > could use for both variables was 127. With a value of 127, GlusterFS > did not produce any errors. However, when I changed the value back to > 128, the RDMA errors appeared again. > > I also tried setting soft/hard "memlock" to unlimited in the > limits.conf file, but still ran into RDMA errors on the client side > when the count variables were set to 128. > > On Thu, Dec 2, 2010 at 9:04 AM, Jeremy Stout<stout.jeremy at gmail.com> wrote: >> Thank you for the response. I've been testing GlusterFS 3.1.1 on two >> different OpenSUSE 11.3 systems. Since both systems generated the same >> error messages, I'll include the output for both. >> >> System #1: >> fs-1:~ # cat /proc/meminfo >> MemTotal: 16468756 kB >> MemFree: 16126680 kB >> Buffers: 15680 kB >> Cached: 155860 kB >> SwapCached: 0 kB >> Active: 65228 kB >> Inactive: 123100 kB >> Active(anon): 18632 kB >> Inactive(anon): 48 kB >> Active(file): 46596 kB >> Inactive(file): 123052 kB >> Unevictable: 1988 kB >> Mlocked: 1988 kB >> SwapTotal: 0 kB >> SwapFree: 0 kB >> Dirty: 30072 kB >> Writeback: 4 kB >> AnonPages: 18780 kB >> Mapped: 12136 kB >> Shmem: 220 kB >> Slab: 39592 kB >> SReclaimable: 13108 kB >> SUnreclaim: 26484 kB >> KernelStack: 2360 kB >> PageTables: 2036 kB >> NFS_Unstable: 0 kB >> Bounce: 0 kB >> WritebackTmp: 0 kB >> CommitLimit: 8234376 kB >> Committed_AS: 107304 kB >> VmallocTotal: 34359738367 kB >> VmallocUsed: 314316 kB >> VmallocChunk: 34349860776 kB >> HardwareCorrupted: 0 kB >> HugePages_Total: 0 >> HugePages_Free: 0 >> HugePages_Rsvd: 0 >> HugePages_Surp: 0 >> Hugepagesize: 2048 kB >> DirectMap4k: 9856 kB >> DirectMap2M: 3135488 kB >> DirectMap1G: 13631488 kB >> >> fs-1:~ # uname -a >> Linux fs-1 2.6.32.25-November2010 #2 SMP PREEMPT Mon Nov 1 15:19:55 >> EDT 2010 x86_64 x86_64 x86_64 GNU/Linux >> >> fs-1:~ # ulimit -l >> 64 >> >> System #2: >> submit-1:~ # cat /proc/meminfo >> MemTotal: 16470424 kB >> MemFree: 16197292 kB >> Buffers: 11788 kB >> Cached: 85492 kB >> SwapCached: 0 kB >> Active: 39120 kB >> Inactive: 76548 kB >> Active(anon): 18532 kB >> Inactive(anon): 48 kB >> Active(file): 20588 kB >> Inactive(file): 76500 kB >> Unevictable: 0 kB >> Mlocked: 0 kB >> SwapTotal: 67100656 kB >> SwapFree: 67100656 kB >> Dirty: 24 kB >> Writeback: 0 kB >> AnonPages: 18408 kB >> Mapped: 11644 kB >> Shmem: 184 kB >> Slab: 34000 kB >> SReclaimable: 8512 kB >> SUnreclaim: 25488 kB >> KernelStack: 2160 kB >> PageTables: 1952 kB >> NFS_Unstable: 0 kB >> Bounce: 0 kB >> WritebackTmp: 0 kB >> CommitLimit: 75335868 kB >> Committed_AS: 105620 kB >> VmallocTotal: 34359738367 kB >> VmallocUsed: 76416 kB >> VmallocChunk: 34359652640 kB >> HardwareCorrupted: 0 kB >> HugePages_Total: 0 >> HugePages_Free: 0 >> HugePages_Rsvd: 0 >> HugePages_Surp: 0 >> Hugepagesize: 2048 kB >> DirectMap4k: 7488 kB >> DirectMap2M: 16769024 kB >> >> submit-1:~ # uname -a >> Linux submit-1 2.6.33.7-November2010 #1 SMP PREEMPT Mon Nov 8 13:49:00 >> EST 2010 x86_64 x86_64 x86_64 GNU/Linux >> >> submit-1:~ # ulimit -l >> 64 >> >> I retrieved the memory information on each machine after starting the >> glusterd process. >> >> On Thu, Dec 2, 2010 at 3:51 AM, Raghavendra G<raghavendra at gluster.com> wrote: >>> Hi Jeremy, >>> >>> can you also get the output of, >>> >>> #uname -a >>> >>> #ulimit -l >>> >>> regards, >>> ----- Original Message ----- >>> From: "Raghavendra G"<raghavendra at gluster.com> >>> To: "Jeremy Stout"<stout.jeremy at gmail.com> >>> Cc: gluster-users at gluster.org >>> Sent: Thursday, December 2, 2010 10:20:04 AM >>> Subject: Re: RDMA Problems with GlusterFS 3.1.1 >>> >>> Hi Jeremy, >>> >>> In order to diagnoise why completion queue creation is failing (as indicated by logs), we want to know what was the free memory available in your system when glusterfs was started. >>> >>> regards, >>> ----- Original Message ----- >>> From: "Raghavendra G"<raghavendra at gluster.com> >>> To: "Jeremy Stout"<stout.jeremy at gmail.com> >>> Cc: gluster-users at gluster.org >>> Sent: Thursday, December 2, 2010 10:11:18 AM >>> Subject: Re: RDMA Problems with GlusterFS 3.1.1 >>> >>> Hi Jeremy, >>> >>> Yes, there might be some performance decrease. But, it should not affect working of rdma. >>> >>> regards, >>> ----- Original Message ----- >>> From: "Jeremy Stout"<stout.jeremy at gmail.com> >>> To: gluster-users at gluster.org >>> Sent: Thursday, December 2, 2010 8:30:20 AM >>> Subject: Re: RDMA Problems with GlusterFS 3.1.1 >>> >>> As an update to my situation, I think I have GlusterFS 3.1.1 working >>> now. I was able to create and mount RDMA volumes without any errors. >>> >>> To fix the problem, I had to make the following changes on lines 3562 >>> and 3563 in rdma.c: >>> options->send_count = 32; >>> options->recv_count = 32; >>> >>> The values were set to 128. >>> >>> I'll run some tests tomorrow to verify that it is working correctly. >>> Assuming it does, what would be the expected side-effect of changing >>> the values from 128 to 32? Will there be a decrease in performance? >>> >>> >>> On Wed, Dec 1, 2010 at 10:07 AM, Jeremy Stout<stout.jeremy at gmail.com> wrote: >>>> Here are the results of the test: >>>> submit-1:/usr/local/glusterfs/3.1.1/var/log/glusterfs # ibv_srq_pingpong >>>> local address: LID 0x0002, QPN 0x000406, PSN 0x703b96, GID :: >>>> local address: LID 0x0002, QPN 0x000407, PSN 0x618cc8, GID :: >>>> local address: LID 0x0002, QPN 0x000408, PSN 0xd62272, GID :: >>>> local address: LID 0x0002, QPN 0x000409, PSN 0x5db5d9, GID :: >>>> local address: LID 0x0002, QPN 0x00040a, PSN 0xc51978, GID :: >>>> local address: LID 0x0002, QPN 0x00040b, PSN 0x05fd7a, GID :: >>>> local address: LID 0x0002, QPN 0x00040c, PSN 0xaa4a51, GID :: >>>> local address: LID 0x0002, QPN 0x00040d, PSN 0xb7a676, GID :: >>>> local address: LID 0x0002, QPN 0x00040e, PSN 0x56bde2, GID :: >>>> local address: LID 0x0002, QPN 0x00040f, PSN 0xa662bc, GID :: >>>> local address: LID 0x0002, QPN 0x000410, PSN 0xee27b0, GID :: >>>> local address: LID 0x0002, QPN 0x000411, PSN 0x89c683, GID :: >>>> local address: LID 0x0002, QPN 0x000412, PSN 0xd025b3, GID :: >>>> local address: LID 0x0002, QPN 0x000413, PSN 0xcec8e4, GID :: >>>> local address: LID 0x0002, QPN 0x000414, PSN 0x37e5d2, GID :: >>>> local address: LID 0x0002, QPN 0x000415, PSN 0x29562e, GID :: >>>> remote address: LID 0x000b, QPN 0x000406, PSN 0x3b644e, GID :: >>>> remote address: LID 0x000b, QPN 0x000407, PSN 0x173320, GID :: >>>> remote address: LID 0x000b, QPN 0x000408, PSN 0xc105ea, GID :: >>>> remote address: LID 0x000b, QPN 0x000409, PSN 0x5e5ff1, GID :: >>>> remote address: LID 0x000b, QPN 0x00040a, PSN 0xff15b0, GID :: >>>> remote address: LID 0x000b, QPN 0x00040b, PSN 0xf0b152, GID :: >>>> remote address: LID 0x000b, QPN 0x00040c, PSN 0x4ced49, GID :: >>>> remote address: LID 0x000b, QPN 0x00040d, PSN 0x01da0e, GID :: >>>> remote address: LID 0x000b, QPN 0x00040e, PSN 0x69459a, GID :: >>>> remote address: LID 0x000b, QPN 0x00040f, PSN 0x197c14, GID :: >>>> remote address: LID 0x000b, QPN 0x000410, PSN 0xd50228, GID :: >>>> remote address: LID 0x000b, QPN 0x000411, PSN 0xbc9b9b, GID :: >>>> remote address: LID 0x000b, QPN 0x000412, PSN 0x0870eb, GID :: >>>> remote address: LID 0x000b, QPN 0x000413, PSN 0xfb1fbc, GID :: >>>> remote address: LID 0x000b, QPN 0x000414, PSN 0x3eefca, GID :: >>>> remote address: LID 0x000b, QPN 0x000415, PSN 0xbd64c6, GID :: >>>> 8192000 bytes in 0.01 seconds = 5917.47 Mbit/sec >>>> 1000 iters in 0.01 seconds = 11.07 usec/iter >>>> >>>> fs-1:/usr/local/glusterfs/3.1.1/var/log/glusterfs # ibv_srq_pingpong submit-1 >>>> local address: LID 0x000b, QPN 0x000406, PSN 0x3b644e, GID :: >>>> local address: LID 0x000b, QPN 0x000407, PSN 0x173320, GID :: >>>> local address: LID 0x000b, QPN 0x000408, PSN 0xc105ea, GID :: >>>> local address: LID 0x000b, QPN 0x000409, PSN 0x5e5ff1, GID :: >>>> local address: LID 0x000b, QPN 0x00040a, PSN 0xff15b0, GID :: >>>> local address: LID 0x000b, QPN 0x00040b, PSN 0xf0b152, GID :: >>>> local address: LID 0x000b, QPN 0x00040c, PSN 0x4ced49, GID :: >>>> local address: LID 0x000b, QPN 0x00040d, PSN 0x01da0e, GID :: >>>> local address: LID 0x000b, QPN 0x00040e, PSN 0x69459a, GID :: >>>> local address: LID 0x000b, QPN 0x00040f, PSN 0x197c14, GID :: >>>> local address: LID 0x000b, QPN 0x000410, PSN 0xd50228, GID :: >>>> local address: LID 0x000b, QPN 0x000411, PSN 0xbc9b9b, GID :: >>>> local address: LID 0x000b, QPN 0x000412, PSN 0x0870eb, GID :: >>>> local address: LID 0x000b, QPN 0x000413, PSN 0xfb1fbc, GID :: >>>> local address: LID 0x000b, QPN 0x000414, PSN 0x3eefca, GID :: >>>> local address: LID 0x000b, QPN 0x000415, PSN 0xbd64c6, GID :: >>>> remote address: LID 0x0002, QPN 0x000406, PSN 0x703b96, GID :: >>>> remote address: LID 0x0002, QPN 0x000407, PSN 0x618cc8, GID :: >>>> remote address: LID 0x0002, QPN 0x000408, PSN 0xd62272, GID :: >>>> remote address: LID 0x0002, QPN 0x000409, PSN 0x5db5d9, GID :: >>>> remote address: LID 0x0002, QPN 0x00040a, PSN 0xc51978, GID :: >>>> remote address: LID 0x0002, QPN 0x00040b, PSN 0x05fd7a, GID :: >>>> remote address: LID 0x0002, QPN 0x00040c, PSN 0xaa4a51, GID :: >>>> remote address: LID 0x0002, QPN 0x00040d, PSN 0xb7a676, GID :: >>>> remote address: LID 0x0002, QPN 0x00040e, PSN 0x56bde2, GID :: >>>> remote address: LID 0x0002, QPN 0x00040f, PSN 0xa662bc, GID :: >>>> remote address: LID 0x0002, QPN 0x000410, PSN 0xee27b0, GID :: >>>> remote address: LID 0x0002, QPN 0x000411, PSN 0x89c683, GID :: >>>> remote address: LID 0x0002, QPN 0x000412, PSN 0xd025b3, GID :: >>>> remote address: LID 0x0002, QPN 0x000413, PSN 0xcec8e4, GID :: >>>> remote address: LID 0x0002, QPN 0x000414, PSN 0x37e5d2, GID :: >>>> remote address: LID 0x0002, QPN 0x000415, PSN 0x29562e, GID :: >>>> 8192000 bytes in 0.01 seconds = 7423.65 Mbit/sec >>>> 1000 iters in 0.01 seconds = 8.83 usec/iter >>>> >>>> Based on the output, I believe it ran correctly. >>>> >>>> On Wed, Dec 1, 2010 at 9:51 AM, Anand Avati<anand.avati at gmail.com> wrote: >>>>> Can you verify that ibv_srq_pingpong works from the server where this log >>>>> file is from? >>>>> >>>>> Thanks, >>>>> Avati >>>>> >>>>> On Wed, Dec 1, 2010 at 7:44 PM, Jeremy Stout<stout.jeremy at gmail.com> wrote: >>>>>> Whenever I try to start or mount a GlusterFS 3.1.1 volume that uses >>>>>> RDMA, I'm seeing the following error messages in the log file on the >>>>>> server: >>>>>> [2010-11-30 18:37:53.51270] I [nfs.c:652:init] nfs: NFS service started >>>>>> [2010-11-30 18:37:53.51362] W [dict.c:1204:data_to_str] dict: @data=(nil) >>>>>> [2010-11-30 18:37:53.51375] W [dict.c:1204:data_to_str] dict: @data=(nil) >>>>>> [2010-11-30 18:37:53.59628] E [rdma.c:2066:rdma_create_cq] >>>>>> rpc-transport/rdma: testdir-client-0: creation of send_cq failed >>>>>> [2010-11-30 18:37:53.59851] E [rdma.c:3771:rdma_get_device] >>>>>> rpc-transport/rdma: testdir-client-0: could not create CQ >>>>>> [2010-11-30 18:37:53.59925] E [rdma.c:3957:rdma_init] >>>>>> rpc-transport/rdma: could not create rdma device for mthca0 >>>>>> [2010-11-30 18:37:53.60009] E [rdma.c:4789:init] testdir-client-0: >>>>>> Failed to initialize IB Device >>>>>> [2010-11-30 18:37:53.60030] E [rpc-transport.c:971:rpc_transport_load] >>>>>> rpc-transport: 'rdma' initialization failed >>>>>> >>>>>> On the client, I see: >>>>>> [2010-11-30 18:43:49.653469] W [io-stats.c:1644:init] testdir: >>>>>> dangling volume. check volfile >>>>>> [2010-11-30 18:43:49.653573] W [dict.c:1204:data_to_str] dict: @data=(nil) >>>>>> [2010-11-30 18:43:49.653607] W [dict.c:1204:data_to_str] dict: @data=(nil) >>>>>> [2010-11-30 18:43:49.736275] E [rdma.c:2066:rdma_create_cq] >>>>>> rpc-transport/rdma: testdir-client-0: creation of send_cq failed >>>>>> [2010-11-30 18:43:49.736651] E [rdma.c:3771:rdma_get_device] >>>>>> rpc-transport/rdma: testdir-client-0: could not create CQ >>>>>> [2010-11-30 18:43:49.736689] E [rdma.c:3957:rdma_init] >>>>>> rpc-transport/rdma: could not create rdma device for mthca0 >>>>>> [2010-11-30 18:43:49.736805] E [rdma.c:4789:init] testdir-client-0: >>>>>> Failed to initialize IB Device >>>>>> [2010-11-30 18:43:49.736841] E >>>>>> [rpc-transport.c:971:rpc_transport_load] rpc-transport: 'rdma' >>>>>> initialization failed >>>>>> >>>>>> This results in an unsuccessful mount. >>>>>> >>>>>> I created the mount using the following commands: >>>>>> /usr/local/glusterfs/3.1.1/sbin/gluster volume create testdir >>>>>> transport rdma submit-1:/exports >>>>>> /usr/local/glusterfs/3.1.1/sbin/gluster volume start testdir >>>>>> >>>>>> To mount the directory, I use: >>>>>> mount -t glusterfs submit-1:/testdir /mnt/glusterfs >>>>>> >>>>>> I don't think it is an Infiniband problem since GlusterFS 3.0.6 and >>>>>> GlusterFS 3.1.0 worked on the same systems. For GlusterFS 3.1.0, the >>>>>> commands listed above produced no error messages. >>>>>> >>>>>> If anyone can provide help with debugging these error messages, it >>>>>> would be appreciated. >>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >>>>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >>> > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users