Hi, Raghavendra, Jeremy Thanks, I have tried with the patch and also with ofed 1.5.2 and got pretty much what Jeremy had: [2010-12-10 13:32:59.69007] E [rdma.c:2047:rdma_create_cq] rpc-transport/rdma: max_mr_size = 18446744073709551615, max_cq = 65408, max_cqe = 131071, max_mr = 131056 Aren't these parameters configurable on some driver level? I am a bit new to the IB business, so don't know... How do you suggest to proceed? To try the unaccepted patch? cheers Artem. On Fri, Dec 10, 2010 at 6:22 AM, Raghavendra G <raghavendra at gluster.com> wrote: > Hi Artem, > > you can check the maximum limits using the patch I had sent earlier in the same thread. Also, the patch > http://patches.gluster.com/patch/5844/ (which is not accepted yet), will check for whether the number of cqe being passed in ibv_creation_cq is greater than the value allowed by the device and if so, it will try to create CQ with maximum limit allowed by the device. > > regards, > ----- Original Message ----- > From: "Artem Trunov" <datamove at gmail.com> > To: "Raghavendra G" <raghavendra at gluster.com> > Cc: "Jeremy Stout" <stout.jeremy at gmail.com>, gluster-users at gluster.org > Sent: Thursday, December 9, 2010 7:13:40 PM > Subject: Re: RDMA Problems with GlusterFS 3.1.1 > > Hi, Ravendra, Jeremy > > This was very interesting debugging thread to me, since I have the > same symptoms, but unsure of the origin. Please see log for my mount > command at the end of the message. > > I have installed 3.3.1. My OFED is 1.5.1 - does it make serious > difference between already mentioned 1.5.2? > > On hardware limitations - I have Mellanox InfiniHost III Lx 20Gb/s and > it says in specs: > > "Supports 16 million QPs, EEs & CQs " > > Is this enough? How can I query for actual settings on max_cq, max_cqe? > > In general, how should I proceed? What are my other debugging options? > Should I try to go Jeremy path with hacking the gluster code? > > cheers > Artem. > > Log: > > --------- > [2010-12-09 15:15:53.847595] W [io-stats.c:1644:init] test-volume: > dangling volume. check volfile > [2010-12-09 15:15:53.847643] W [dict.c:1204:data_to_str] dict: @data=(nil) > [2010-12-09 15:15:53.847657] W [dict.c:1204:data_to_str] dict: @data=(nil) > [2010-12-09 15:15:53.858574] E [rdma.c:2066:rdma_create_cq] > rpc-transport/rdma: test-volume-client-1: creation of send_cq failed > [2010-12-09 15:15:53.858805] E [rdma.c:3771:rdma_get_device] > rpc-transport/rdma: test-volume-client-1: could not create CQ > [2010-12-09 15:15:53.858821] E [rdma.c:3957:rdma_init] > rpc-transport/rdma: could not create rdma device for mthca0 > [2010-12-09 15:15:53.858893] E [rdma.c:4789:init] > test-volume-client-1: Failed to initialize IB Device > [2010-12-09 15:15:53.858909] E > [rpc-transport.c:971:rpc_transport_load] rpc-transport: 'rdma' > initialization failed > pending frames: > > patchset: v3.1.1 > signal received: 11 > time of crash: 2010-12-09 15:15:53 > configuration details: > argp 1 > backtrace 1 > dlfcn 1 > fdatasync 1 > libpthread 1 > llistxattr 1 > setfsid 1 > spinlock 1 > epoll.h 1 > xattr.h 1 > st_atim.tv_nsec 1 > package-string: glusterfs 3.1.1 > /lib64/libc.so.6[0x32aca302d0] > /lib64/libc.so.6(strcmp+0x0)[0x32aca79140] > /usr/lib64/glusterfs/3.1.1/rpc-transport/rdma.so[0x2aaaac4fef6c] > /usr/lib64/glusterfs/3.1.1/rpc-transport/rdma.so(init+0x2f)[0x2aaaac50013f] > /usr/lib64/libgfrpc.so.0(rpc_transport_load+0x389)[0x3fcca0cac9] > /usr/lib64/libgfrpc.so.0(rpc_clnt_new+0xfe)[0x3fcca1053e] > /usr/lib64/glusterfs/3.1.1/xlator/protocol/client.so(client_init_rpc+0xa1)[0x2aaaab194f01] > /usr/lib64/glusterfs/3.1.1/xlator/protocol/client.so(init+0x129)[0x2aaaab1950d9] > /usr/lib64/libglusterfs.so.0(xlator_init+0x58)[0x3fcc617398] > /usr/lib64/libglusterfs.so.0(glusterfs_graph_init+0x31)[0x3fcc640291] > /usr/lib64/libglusterfs.so.0(glusterfs_graph_activate+0x38)[0x3fcc6403c8] > /usr/sbin/glusterfs(glusterfs_process_volfp+0xfa)[0x40373a] > /usr/sbin/glusterfs(mgmt_getspec_cbk+0xc5)[0x406125] > /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x3fcca0f542] > /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x8d)[0x3fcca0f73d] > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x2c)[0x3fcca0a95c] > /usr/lib64/glusterfs/3.1.1/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x2aaaaad6ef9f] > /usr/lib64/glusterfs/3.1.1/rpc-transport/socket.so(socket_event_handler+0x170)[0x2aaaaad6f130] > /usr/lib64/libglusterfs.so.0[0x3fcc637917] > /usr/sbin/glusterfs(main+0x39b)[0x40470b] > /lib64/libc.so.6(__libc_start_main+0xf4)[0x32aca1d994] > /usr/sbin/glusterfs[0x402e29] > > > > > On Fri, Dec 3, 2010 at 1:53 PM, Raghavendra G <raghavendra at gluster.com> wrote: >> From the logs its evident that the reason for completion queue creation failure is that the number of completion queue elements (in a completion queue) we had requested in ibv_create_cq, (1024 * send_count) is less than the maximum supported by the ib hardware (max_cqe = 131071). >> >> ----- Original Message ----- >> From: "Jeremy Stout" <stout.jeremy at gmail.com> >> To: "Raghavendra G" <raghavendra at gluster.com> >> Cc: gluster-users at gluster.org >> Sent: Friday, December 3, 2010 4:20:04 PM >> Subject: Re: RDMA Problems with GlusterFS 3.1.1 >> >> I patched the source code and rebuilt GlusterFS. Here are the full logs: >> Server: >> [2010-12-03 07:08:55.945804] I [glusterd.c:275:init] management: Using >> /etc/glusterd as working directory >> [2010-12-03 07:08:55.947692] E [rdma.c:2047:rdma_create_cq] >> rpc-transport/rdma: max_mr_size = 18446744073709551615, max_cq = >> 65408, max_cqe = 131071, max_mr = 131056 >> [2010-12-03 07:08:55.953226] E [rdma.c:2079:rdma_create_cq] >> rpc-transport/rdma: rdma.management: creation of send_cq failed >> [2010-12-03 07:08:55.953509] E [rdma.c:3785:rdma_get_device] >> rpc-transport/rdma: rdma.management: could not create CQ >> [2010-12-03 07:08:55.953582] E [rdma.c:3971:rdma_init] >> rpc-transport/rdma: could not create rdma device for mthca0 >> [2010-12-03 07:08:55.953668] E [rdma.c:4803:init] rdma.management: >> Failed to initialize IB Device >> [2010-12-03 07:08:55.953691] E >> [rpc-transport.c:971:rpc_transport_load] rpc-transport: 'rdma' >> initialization failed >> [2010-12-03 07:08:55.953780] I [glusterd.c:96:glusterd_uuid_init] >> glusterd: generated UUID: 4eb47ca7-227c-49c4-97bd-25ac177b2f6a >> Given volfile: >> +------------------------------------------------------------------------------+ >> ?1: volume management >> ?2: ? ? type mgmt/glusterd >> ?3: ? ? option working-directory /etc/glusterd >> ?4: ? ? option transport-type socket,rdma >> ?5: ? ? option transport.socket.keepalive-time 10 >> ?6: ? ? option transport.socket.keepalive-interval 2 >> ?7: end-volume >> ?8: >> >> +------------------------------------------------------------------------------+ >> [2010-12-03 07:09:10.244790] I >> [glusterd-handler.c:785:glusterd_handle_create_volume] glusterd: >> Received create volume req >> [2010-12-03 07:09:10.247646] I [glusterd-utils.c:232:glusterd_lock] >> glusterd: Cluster lock held by 4eb47ca7-227c-49c4-97bd-25ac177b2f6a >> [2010-12-03 07:09:10.247678] I >> [glusterd-handler.c:2835:glusterd_op_txn_begin] glusterd: Acquired >> local lock >> [2010-12-03 07:09:10.247708] I >> [glusterd3_1-mops.c:1091:glusterd3_1_cluster_lock] glusterd: Sent lock >> req to 0 peers >> [2010-12-03 07:09:10.248038] I >> [glusterd3_1-mops.c:1233:glusterd3_1_stage_op] glusterd: Sent op req >> to 0 peers >> [2010-12-03 07:09:10.251970] I >> [glusterd3_1-mops.c:1323:glusterd3_1_commit_op] glusterd: Sent op req >> to 0 peers >> [2010-12-03 07:09:10.252020] I >> [glusterd3_1-mops.c:1145:glusterd3_1_cluster_unlock] glusterd: Sent >> unlock req to 0 peers >> [2010-12-03 07:09:10.252036] I >> [glusterd-op-sm.c:4738:glusterd_op_txn_complete] glusterd: Cleared >> local lock >> [2010-12-03 07:09:22.11649] I >> [glusterd-handler.c:936:glusterd_handle_cli_start_volume] glusterd: >> Received start vol reqfor volume testdir >> [2010-12-03 07:09:22.11724] I [glusterd-utils.c:232:glusterd_lock] >> glusterd: Cluster lock held by 4eb47ca7-227c-49c4-97bd-25ac177b2f6a >> [2010-12-03 07:09:22.11734] I >> [glusterd-handler.c:2835:glusterd_op_txn_begin] glusterd: Acquired >> local lock >> [2010-12-03 07:09:22.11761] I >> [glusterd3_1-mops.c:1091:glusterd3_1_cluster_lock] glusterd: Sent lock >> req to 0 peers >> [2010-12-03 07:09:22.12120] I >> [glusterd3_1-mops.c:1233:glusterd3_1_stage_op] glusterd: Sent op req >> to 0 peers >> [2010-12-03 07:09:22.184403] I >> [glusterd-utils.c:971:glusterd_volume_start_glusterfs] : About to >> start glusterfs for brick pgh-submit-1:/mnt/gluster >> [2010-12-03 07:09:22.229143] I >> [glusterd3_1-mops.c:1323:glusterd3_1_commit_op] glusterd: Sent op req >> to 0 peers >> [2010-12-03 07:09:22.229198] I >> [glusterd3_1-mops.c:1145:glusterd3_1_cluster_unlock] glusterd: Sent >> unlock req to 0 peers >> [2010-12-03 07:09:22.229218] I >> [glusterd-op-sm.c:4738:glusterd_op_txn_complete] glusterd: Cleared >> local lock >> [2010-12-03 07:09:22.240157] I >> [glusterd-pmap.c:281:pmap_registry_remove] pmap: removing brick (null) >> on port 24009 >> >> >> Client: >> [2010-12-03 07:09:00.82784] W [io-stats.c:1644:init] testdir: dangling >> volume. check volfile >> [2010-12-03 07:09:00.82824] W [dict.c:1204:data_to_str] dict: @data=(nil) >> [2010-12-03 07:09:00.82836] W [dict.c:1204:data_to_str] dict: @data=(nil) >> [2010-12-03 07:09:00.85980] E [rdma.c:2047:rdma_create_cq] >> rpc-transport/rdma: max_mr_size = 18446744073709551615, max_cq = >> 65408, max_cqe = 131071, max_mr = 131056 >> [2010-12-03 07:09:00.92883] E [rdma.c:2079:rdma_create_cq] >> rpc-transport/rdma: testdir-client-0: creation of send_cq failed >> [2010-12-03 07:09:00.93156] E [rdma.c:3785:rdma_get_device] >> rpc-transport/rdma: testdir-client-0: could not create CQ >> [2010-12-03 07:09:00.93224] E [rdma.c:3971:rdma_init] >> rpc-transport/rdma: could not create rdma device for mthca0 >> [2010-12-03 07:09:00.93313] E [rdma.c:4803:init] testdir-client-0: >> Failed to initialize IB Device >> [2010-12-03 07:09:00.93332] E [rpc-transport.c:971:rpc_transport_load] >> rpc-transport: 'rdma' initialization failed >> Given volfile: >> +------------------------------------------------------------------------------+ >> ?1: volume testdir-client-0 >> ?2: ? ? type protocol/client >> ?3: ? ? option remote-host submit-1 >> ?4: ? ? option remote-subvolume /mnt/gluster >> ?5: ? ? option transport-type rdma >> ?6: end-volume >> ?7: >> ?8: volume testdir-write-behind >> ?9: ? ? type performance/write-behind >> ?10: ? ? subvolumes testdir-client-0 >> ?11: end-volume >> ?12: >> ?13: volume testdir-read-ahead >> ?14: ? ? type performance/read-ahead >> ?15: ? ? subvolumes testdir-write-behind >> ?16: end-volume >> ?17: >> ?18: volume testdir-io-cache >> ?19: ? ? type performance/io-cache >> ?20: ? ? subvolumes testdir-read-ahead >> ?21: end-volume >> ?22: >> ?23: volume testdir-quick-read >> ?24: ? ? type performance/quick-read >> ?25: ? ? subvolumes testdir-io-cache >> ?26: end-volume >> ?27: >> ?28: volume testdir-stat-prefetch >> ?29: ? ? type performance/stat-prefetch >> ?30: ? ? subvolumes testdir-quick-read >> ?31: end-volume >> ?32: >> ?33: volume testdir >> ?34: ? ? type debug/io-stats >> ?35: ? ? subvolumes testdir-stat-prefetch >> ?36: end-volume >> >> +------------------------------------------------------------------------------+ >> >> >> On Fri, Dec 3, 2010 at 12:38 AM, Raghavendra G <raghavendra at gluster.com> wrote: >>> Hi Jeremy, >>> >>> Can you apply the attached patch, rebuild and start glusterfs? Please make sure to send us the logs of glusterfs. >>> >>> regards, >>> ----- Original Message ----- >>> From: "Jeremy Stout" <stout.jeremy at gmail.com> >>> To: gluster-users at gluster.org >>> Sent: Friday, December 3, 2010 6:38:00 AM >>> Subject: Re: RDMA Problems with GlusterFS 3.1.1 >>> >>> I'm currently using OFED 1.5.2. >>> >>> For the sake of testing, I just compiled GlusterFS 3.1.1 from source, >>> without any modifications, on two systems that have a 2.6.33.7 kernel >>> and OFED 1.5.2 built from source. Here are the results: >>> >>> Server: >>> [2010-12-02 21:17:55.886563] I >>> [glusterd-handler.c:936:glusterd_handle_cli_start_volume] glusterd: >>> Received start vol reqfor volume testdir >>> [2010-12-02 21:17:55.886597] I [glusterd-utils.c:232:glusterd_lock] >>> glusterd: Cluster lock held by 7dd23af5-277e-4ea1-a495-2a9d882287ec >>> [2010-12-02 21:17:55.886607] I >>> [glusterd-handler.c:2835:glusterd_op_txn_begin] glusterd: Acquired >>> local lock >>> [2010-12-02 21:17:55.886628] I >>> [glusterd3_1-mops.c:1091:glusterd3_1_cluster_lock] glusterd: Sent lock >>> req to 0 peers >>> [2010-12-02 21:17:55.887031] I >>> [glusterd3_1-mops.c:1233:glusterd3_1_stage_op] glusterd: Sent op req >>> to 0 peers >>> [2010-12-02 21:17:56.60427] I >>> [glusterd-utils.c:971:glusterd_volume_start_glusterfs] : About to >>> start glusterfs for brick submit-1:/mnt/gluster >>> [2010-12-02 21:17:56.104896] I >>> [glusterd3_1-mops.c:1323:glusterd3_1_commit_op] glusterd: Sent op req >>> to 0 peers >>> [2010-12-02 21:17:56.104935] I >>> [glusterd3_1-mops.c:1145:glusterd3_1_cluster_unlock] glusterd: Sent >>> unlock req to 0 peers >>> [2010-12-02 21:17:56.104953] I >>> [glusterd-op-sm.c:4738:glusterd_op_txn_complete] glusterd: Cleared >>> local lock >>> [2010-12-02 21:17:56.114764] I >>> [glusterd-pmap.c:281:pmap_registry_remove] pmap: removing brick (null) >>> on port 24009 >>> >>> Client: >>> [2010-12-02 21:17:25.503395] W [io-stats.c:1644:init] testdir: >>> dangling volume. check volfile >>> [2010-12-02 21:17:25.503434] W [dict.c:1204:data_to_str] dict: @data=(nil) >>> [2010-12-02 21:17:25.503447] W [dict.c:1204:data_to_str] dict: @data=(nil) >>> [2010-12-02 21:17:25.543409] E [rdma.c:2066:rdma_create_cq] >>> rpc-transport/rdma: testdir-client-0: creation of send_cq failed >>> [2010-12-02 21:17:25.543660] E [rdma.c:3771:rdma_get_device] >>> rpc-transport/rdma: testdir-client-0: could not create CQ >>> [2010-12-02 21:17:25.543725] E [rdma.c:3957:rdma_init] >>> rpc-transport/rdma: could not create rdma device for mthca0 >>> [2010-12-02 21:17:25.543812] E [rdma.c:4789:init] testdir-client-0: >>> Failed to initialize IB Device >>> [2010-12-02 21:17:25.543830] E >>> [rpc-transport.c:971:rpc_transport_load] rpc-transport: 'rdma' >>> initialization failed >>> >>> Thank you for the help so far. >>> >>> On Thu, Dec 2, 2010 at 8:02 PM, Craig Carl <craig at gluster.com> wrote: >>>> Jeremy - >>>> ? What version of OFED are you running? Would you mind install version 1.5.2 >>>> from source? We have seen this resolve several issues of this type. >>>> http://www.openfabrics.org/downloads/OFED/ofed-1.5.2/ >>>> >>>> >>>> Thanks, >>>> >>>> Craig >>>> >>>> --> >>>> Craig Carl >>>> Senior Systems Engineer >>>> Gluster >>>> >>>> >>>> On 12/02/2010 10:05 AM, Jeremy Stout wrote: >>>>> >>>>> An another follow-up, I tested several compilations today with >>>>> different values for send/receive count. I found the maximum value I >>>>> could use for both variables was 127. With a value of 127, GlusterFS >>>>> did not produce any errors. However, when I changed the value back to >>>>> 128, the RDMA errors appeared again. >>>>> >>>>> I also tried setting soft/hard "memlock" to unlimited in the >>>>> limits.conf file, but still ran into RDMA errors on the client side >>>>> when the count variables were set to 128. >>>>> >>>>> On Thu, Dec 2, 2010 at 9:04 AM, Jeremy Stout<stout.jeremy at gmail.com> >>>>> ?wrote: >>>>>> >>>>>> Thank you for the response. I've been testing GlusterFS 3.1.1 on two >>>>>> different OpenSUSE 11.3 systems. Since both systems generated the same >>>>>> error messages, I'll include the output for both. >>>>>> >>>>>> System #1: >>>>>> fs-1:~ # cat /proc/meminfo >>>>>> MemTotal: ? ? ? 16468756 kB >>>>>> MemFree: ? ? ? ?16126680 kB >>>>>> Buffers: ? ? ? ? ? 15680 kB >>>>>> Cached: ? ? ? ? ? 155860 kB >>>>>> SwapCached: ? ? ? ? ? ?0 kB >>>>>> Active: ? ? ? ? ? ?65228 kB >>>>>> Inactive: ? ? ? ? 123100 kB >>>>>> Active(anon): ? ? ?18632 kB >>>>>> Inactive(anon): ? ? ? 48 kB >>>>>> Active(file): ? ? ?46596 kB >>>>>> Inactive(file): ? 123052 kB >>>>>> Unevictable: ? ? ? ?1988 kB >>>>>> Mlocked: ? ? ? ? ? ?1988 kB >>>>>> SwapTotal: ? ? ? ? ? ? 0 kB >>>>>> SwapFree: ? ? ? ? ? ? ?0 kB >>>>>> Dirty: ? ? ? ? ? ? 30072 kB >>>>>> Writeback: ? ? ? ? ? ? 4 kB >>>>>> AnonPages: ? ? ? ? 18780 kB >>>>>> Mapped: ? ? ? ? ? ?12136 kB >>>>>> Shmem: ? ? ? ? ? ? ? 220 kB >>>>>> Slab: ? ? ? ? ? ? ?39592 kB >>>>>> SReclaimable: ? ? ?13108 kB >>>>>> SUnreclaim: ? ? ? ?26484 kB >>>>>> KernelStack: ? ? ? ?2360 kB >>>>>> PageTables: ? ? ? ? 2036 kB >>>>>> NFS_Unstable: ? ? ? ? ?0 kB >>>>>> Bounce: ? ? ? ? ? ? ? ?0 kB >>>>>> WritebackTmp: ? ? ? ? ?0 kB >>>>>> CommitLimit: ? ? 8234376 kB >>>>>> Committed_AS: ? ? 107304 kB >>>>>> VmallocTotal: ? 34359738367 kB >>>>>> VmallocUsed: ? ? ?314316 kB >>>>>> VmallocChunk: ? 34349860776 kB >>>>>> HardwareCorrupted: ? ? 0 kB >>>>>> HugePages_Total: ? ? ? 0 >>>>>> HugePages_Free: ? ? ? ?0 >>>>>> HugePages_Rsvd: ? ? ? ?0 >>>>>> HugePages_Surp: ? ? ? ?0 >>>>>> Hugepagesize: ? ? ? 2048 kB >>>>>> DirectMap4k: ? ? ? ?9856 kB >>>>>> DirectMap2M: ? ? 3135488 kB >>>>>> DirectMap1G: ? ?13631488 kB >>>>>> >>>>>> fs-1:~ # uname -a >>>>>> Linux fs-1 2.6.32.25-November2010 #2 SMP PREEMPT Mon Nov 1 15:19:55 >>>>>> EDT 2010 x86_64 x86_64 x86_64 GNU/Linux >>>>>> >>>>>> fs-1:~ # ulimit -l >>>>>> 64 >>>>>> >>>>>> System #2: >>>>>> submit-1:~ # cat /proc/meminfo >>>>>> MemTotal: ? ? ? 16470424 kB >>>>>> MemFree: ? ? ? ?16197292 kB >>>>>> Buffers: ? ? ? ? ? 11788 kB >>>>>> Cached: ? ? ? ? ? ?85492 kB >>>>>> SwapCached: ? ? ? ? ? ?0 kB >>>>>> Active: ? ? ? ? ? ?39120 kB >>>>>> Inactive: ? ? ? ? ?76548 kB >>>>>> Active(anon): ? ? ?18532 kB >>>>>> Inactive(anon): ? ? ? 48 kB >>>>>> Active(file): ? ? ?20588 kB >>>>>> Inactive(file): ? ?76500 kB >>>>>> Unevictable: ? ? ? ? ? 0 kB >>>>>> Mlocked: ? ? ? ? ? ? ? 0 kB >>>>>> SwapTotal: ? ? ?67100656 kB >>>>>> SwapFree: ? ? ? 67100656 kB >>>>>> Dirty: ? ? ? ? ? ? ? ?24 kB >>>>>> Writeback: ? ? ? ? ? ? 0 kB >>>>>> AnonPages: ? ? ? ? 18408 kB >>>>>> Mapped: ? ? ? ? ? ?11644 kB >>>>>> Shmem: ? ? ? ? ? ? ? 184 kB >>>>>> Slab: ? ? ? ? ? ? ?34000 kB >>>>>> SReclaimable: ? ? ? 8512 kB >>>>>> SUnreclaim: ? ? ? ?25488 kB >>>>>> KernelStack: ? ? ? ?2160 kB >>>>>> PageTables: ? ? ? ? 1952 kB >>>>>> NFS_Unstable: ? ? ? ? ?0 kB >>>>>> Bounce: ? ? ? ? ? ? ? ?0 kB >>>>>> WritebackTmp: ? ? ? ? ?0 kB >>>>>> CommitLimit: ? ?75335868 kB >>>>>> Committed_AS: ? ? 105620 kB >>>>>> VmallocTotal: ? 34359738367 kB >>>>>> VmallocUsed: ? ? ? 76416 kB >>>>>> VmallocChunk: ? 34359652640 kB >>>>>> HardwareCorrupted: ? ? 0 kB >>>>>> HugePages_Total: ? ? ? 0 >>>>>> HugePages_Free: ? ? ? ?0 >>>>>> HugePages_Rsvd: ? ? ? ?0 >>>>>> HugePages_Surp: ? ? ? ?0 >>>>>> Hugepagesize: ? ? ? 2048 kB >>>>>> DirectMap4k: ? ? ? ?7488 kB >>>>>> DirectMap2M: ? ?16769024 kB >>>>>> >>>>>> submit-1:~ # uname -a >>>>>> Linux submit-1 2.6.33.7-November2010 #1 SMP PREEMPT Mon Nov 8 13:49:00 >>>>>> EST 2010 x86_64 x86_64 x86_64 GNU/Linux >>>>>> >>>>>> submit-1:~ # ulimit -l >>>>>> 64 >>>>>> >>>>>> I retrieved the memory information on each machine after starting the >>>>>> glusterd process. >>>>>> >>>>>> On Thu, Dec 2, 2010 at 3:51 AM, Raghavendra G<raghavendra at gluster.com> >>>>>> ?wrote: >>>>>>> >>>>>>> Hi Jeremy, >>>>>>> >>>>>>> can you also get the output of, >>>>>>> >>>>>>> #uname -a >>>>>>> >>>>>>> #ulimit -l >>>>>>> >>>>>>> regards, >>>>>>> ----- Original Message ----- >>>>>>> From: "Raghavendra G"<raghavendra at gluster.com> >>>>>>> To: "Jeremy Stout"<stout.jeremy at gmail.com> >>>>>>> Cc: gluster-users at gluster.org >>>>>>> Sent: Thursday, December 2, 2010 10:20:04 AM >>>>>>> Subject: Re: RDMA Problems with GlusterFS 3.1.1 >>>>>>> >>>>>>> Hi Jeremy, >>>>>>> >>>>>>> In order to diagnoise why completion queue creation is failing (as >>>>>>> indicated by logs), we want to know what was the free memory available in >>>>>>> your system when glusterfs was started. >>>>>>> >>>>>>> regards, >>>>>>> ----- Original Message ----- >>>>>>> From: "Raghavendra G"<raghavendra at gluster.com> >>>>>>> To: "Jeremy Stout"<stout.jeremy at gmail.com> >>>>>>> Cc: gluster-users at gluster.org >>>>>>> Sent: Thursday, December 2, 2010 10:11:18 AM >>>>>>> Subject: Re: RDMA Problems with GlusterFS 3.1.1 >>>>>>> >>>>>>> Hi Jeremy, >>>>>>> >>>>>>> Yes, there might be some performance decrease. But, it should not affect >>>>>>> working of rdma. >>>>>>> >>>>>>> regards, >>>>>>> ----- Original Message ----- >>>>>>> From: "Jeremy Stout"<stout.jeremy at gmail.com> >>>>>>> To: gluster-users at gluster.org >>>>>>> Sent: Thursday, December 2, 2010 8:30:20 AM >>>>>>> Subject: Re: RDMA Problems with GlusterFS 3.1.1 >>>>>>> >>>>>>> As an update to my situation, I think I have GlusterFS 3.1.1 working >>>>>>> now. I was able to create and mount RDMA volumes without any errors. >>>>>>> >>>>>>> To fix the problem, I had to make the following changes on lines 3562 >>>>>>> and 3563 in rdma.c: >>>>>>> options->send_count = 32; >>>>>>> options->recv_count = 32; >>>>>>> >>>>>>> The values were set to 128. >>>>>>> >>>>>>> I'll run some tests tomorrow to verify that it is working correctly. >>>>>>> Assuming it does, what would be the expected side-effect of changing >>>>>>> the values from 128 to 32? Will there be a decrease in performance? >>>>>>> >>>>>>> >>>>>>> On Wed, Dec 1, 2010 at 10:07 AM, Jeremy Stout<stout.jeremy at gmail.com> >>>>>>> ?wrote: >>>>>>>> >>>>>>>> Here are the results of the test: >>>>>>>> submit-1:/usr/local/glusterfs/3.1.1/var/log/glusterfs # >>>>>>>> ibv_srq_pingpong >>>>>>>> ?local address: ?LID 0x0002, QPN 0x000406, PSN 0x703b96, GID :: >>>>>>>> ?local address: ?LID 0x0002, QPN 0x000407, PSN 0x618cc8, GID :: >>>>>>>> ?local address: ?LID 0x0002, QPN 0x000408, PSN 0xd62272, GID :: >>>>>>>> ?local address: ?LID 0x0002, QPN 0x000409, PSN 0x5db5d9, GID :: >>>>>>>> ?local address: ?LID 0x0002, QPN 0x00040a, PSN 0xc51978, GID :: >>>>>>>> ?local address: ?LID 0x0002, QPN 0x00040b, PSN 0x05fd7a, GID :: >>>>>>>> ?local address: ?LID 0x0002, QPN 0x00040c, PSN 0xaa4a51, GID :: >>>>>>>> ?local address: ?LID 0x0002, QPN 0x00040d, PSN 0xb7a676, GID :: >>>>>>>> ?local address: ?LID 0x0002, QPN 0x00040e, PSN 0x56bde2, GID :: >>>>>>>> ?local address: ?LID 0x0002, QPN 0x00040f, PSN 0xa662bc, GID :: >>>>>>>> ?local address: ?LID 0x0002, QPN 0x000410, PSN 0xee27b0, GID :: >>>>>>>> ?local address: ?LID 0x0002, QPN 0x000411, PSN 0x89c683, GID :: >>>>>>>> ?local address: ?LID 0x0002, QPN 0x000412, PSN 0xd025b3, GID :: >>>>>>>> ?local address: ?LID 0x0002, QPN 0x000413, PSN 0xcec8e4, GID :: >>>>>>>> ?local address: ?LID 0x0002, QPN 0x000414, PSN 0x37e5d2, GID :: >>>>>>>> ?local address: ?LID 0x0002, QPN 0x000415, PSN 0x29562e, GID :: >>>>>>>> ?remote address: LID 0x000b, QPN 0x000406, PSN 0x3b644e, GID :: >>>>>>>> ?remote address: LID 0x000b, QPN 0x000407, PSN 0x173320, GID :: >>>>>>>> ?remote address: LID 0x000b, QPN 0x000408, PSN 0xc105ea, GID :: >>>>>>>> ?remote address: LID 0x000b, QPN 0x000409, PSN 0x5e5ff1, GID :: >>>>>>>> ?remote address: LID 0x000b, QPN 0x00040a, PSN 0xff15b0, GID :: >>>>>>>> ?remote address: LID 0x000b, QPN 0x00040b, PSN 0xf0b152, GID :: >>>>>>>> ?remote address: LID 0x000b, QPN 0x00040c, PSN 0x4ced49, GID :: >>>>>>>> ?remote address: LID 0x000b, QPN 0x00040d, PSN 0x01da0e, GID :: >>>>>>>> ?remote address: LID 0x000b, QPN 0x00040e, PSN 0x69459a, GID :: >>>>>>>> ?remote address: LID 0x000b, QPN 0x00040f, PSN 0x197c14, GID :: >>>>>>>> ?remote address: LID 0x000b, QPN 0x000410, PSN 0xd50228, GID :: >>>>>>>> ?remote address: LID 0x000b, QPN 0x000411, PSN 0xbc9b9b, GID :: >>>>>>>> ?remote address: LID 0x000b, QPN 0x000412, PSN 0x0870eb, GID :: >>>>>>>> ?remote address: LID 0x000b, QPN 0x000413, PSN 0xfb1fbc, GID :: >>>>>>>> ?remote address: LID 0x000b, QPN 0x000414, PSN 0x3eefca, GID :: >>>>>>>> ?remote address: LID 0x000b, QPN 0x000415, PSN 0xbd64c6, GID :: >>>>>>>> 8192000 bytes in 0.01 seconds = 5917.47 Mbit/sec >>>>>>>> 1000 iters in 0.01 seconds = 11.07 usec/iter >>>>>>>> >>>>>>>> fs-1:/usr/local/glusterfs/3.1.1/var/log/glusterfs # ibv_srq_pingpong >>>>>>>> submit-1 >>>>>>>> ?local address: ?LID 0x000b, QPN 0x000406, PSN 0x3b644e, GID :: >>>>>>>> ?local address: ?LID 0x000b, QPN 0x000407, PSN 0x173320, GID :: >>>>>>>> ?local address: ?LID 0x000b, QPN 0x000408, PSN 0xc105ea, GID :: >>>>>>>> ?local address: ?LID 0x000b, QPN 0x000409, PSN 0x5e5ff1, GID :: >>>>>>>> ?local address: ?LID 0x000b, QPN 0x00040a, PSN 0xff15b0, GID :: >>>>>>>> ?local address: ?LID 0x000b, QPN 0x00040b, PSN 0xf0b152, GID :: >>>>>>>> ?local address: ?LID 0x000b, QPN 0x00040c, PSN 0x4ced49, GID :: >>>>>>>> ?local address: ?LID 0x000b, QPN 0x00040d, PSN 0x01da0e, GID :: >>>>>>>> ?local address: ?LID 0x000b, QPN 0x00040e, PSN 0x69459a, GID :: >>>>>>>> ?local address: ?LID 0x000b, QPN 0x00040f, PSN 0x197c14, GID :: >>>>>>>> ?local address: ?LID 0x000b, QPN 0x000410, PSN 0xd50228, GID :: >>>>>>>> ?local address: ?LID 0x000b, QPN 0x000411, PSN 0xbc9b9b, GID :: >>>>>>>> ?local address: ?LID 0x000b, QPN 0x000412, PSN 0x0870eb, GID :: >>>>>>>> ?local address: ?LID 0x000b, QPN 0x000413, PSN 0xfb1fbc, GID :: >>>>>>>> ?local address: ?LID 0x000b, QPN 0x000414, PSN 0x3eefca, GID :: >>>>>>>> ?local address: ?LID 0x000b, QPN 0x000415, PSN 0xbd64c6, GID :: >>>>>>>> ?remote address: LID 0x0002, QPN 0x000406, PSN 0x703b96, GID :: >>>>>>>> ?remote address: LID 0x0002, QPN 0x000407, PSN 0x618cc8, GID :: >>>>>>>> ?remote address: LID 0x0002, QPN 0x000408, PSN 0xd62272, GID :: >>>>>>>> ?remote address: LID 0x0002, QPN 0x000409, PSN 0x5db5d9, GID :: >>>>>>>> ?remote address: LID 0x0002, QPN 0x00040a, PSN 0xc51978, GID :: >>>>>>>> ?remote address: LID 0x0002, QPN 0x00040b, PSN 0x05fd7a, GID :: >>>>>>>> ?remote address: LID 0x0002, QPN 0x00040c, PSN 0xaa4a51, GID :: >>>>>>>> ?remote address: LID 0x0002, QPN 0x00040d, PSN 0xb7a676, GID :: >>>>>>>> ?remote address: LID 0x0002, QPN 0x00040e, PSN 0x56bde2, GID :: >>>>>>>> ?remote address: LID 0x0002, QPN 0x00040f, PSN 0xa662bc, GID :: >>>>>>>> ?remote address: LID 0x0002, QPN 0x000410, PSN 0xee27b0, GID :: >>>>>>>> ?remote address: LID 0x0002, QPN 0x000411, PSN 0x89c683, GID :: >>>>>>>> ?remote address: LID 0x0002, QPN 0x000412, PSN 0xd025b3, GID :: >>>>>>>> ?remote address: LID 0x0002, QPN 0x000413, PSN 0xcec8e4, GID :: >>>>>>>> ?remote address: LID 0x0002, QPN 0x000414, PSN 0x37e5d2, GID :: >>>>>>>> ?remote address: LID 0x0002, QPN 0x000415, PSN 0x29562e, GID :: >>>>>>>> 8192000 bytes in 0.01 seconds = 7423.65 Mbit/sec >>>>>>>> 1000 iters in 0.01 seconds = 8.83 usec/iter >>>>>>>> >>>>>>>> Based on the output, I believe it ran correctly. >>>>>>>> >>>>>>>> On Wed, Dec 1, 2010 at 9:51 AM, Anand Avati<anand.avati at gmail.com> >>>>>>>> ?wrote: >>>>>>>>> >>>>>>>>> Can you verify that ibv_srq_pingpong works from the server where this >>>>>>>>> log >>>>>>>>> file is from? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Avati >>>>>>>>> >>>>>>>>> On Wed, Dec 1, 2010 at 7:44 PM, Jeremy Stout<stout.jeremy at gmail.com> >>>>>>>>> ?wrote: >>>>>>>>>> >>>>>>>>>> Whenever I try to start or mount a GlusterFS 3.1.1 volume that uses >>>>>>>>>> RDMA, I'm seeing the following error messages in the log file on the >>>>>>>>>> server: >>>>>>>>>> [2010-11-30 18:37:53.51270] I [nfs.c:652:init] nfs: NFS service >>>>>>>>>> started >>>>>>>>>> [2010-11-30 18:37:53.51362] W [dict.c:1204:data_to_str] dict: >>>>>>>>>> @data=(nil) >>>>>>>>>> [2010-11-30 18:37:53.51375] W [dict.c:1204:data_to_str] dict: >>>>>>>>>> @data=(nil) >>>>>>>>>> [2010-11-30 18:37:53.59628] E [rdma.c:2066:rdma_create_cq] >>>>>>>>>> rpc-transport/rdma: testdir-client-0: creation of send_cq failed >>>>>>>>>> [2010-11-30 18:37:53.59851] E [rdma.c:3771:rdma_get_device] >>>>>>>>>> rpc-transport/rdma: testdir-client-0: could not create CQ >>>>>>>>>> [2010-11-30 18:37:53.59925] E [rdma.c:3957:rdma_init] >>>>>>>>>> rpc-transport/rdma: could not create rdma device for mthca0 >>>>>>>>>> [2010-11-30 18:37:53.60009] E [rdma.c:4789:init] testdir-client-0: >>>>>>>>>> Failed to initialize IB Device >>>>>>>>>> [2010-11-30 18:37:53.60030] E >>>>>>>>>> [rpc-transport.c:971:rpc_transport_load] >>>>>>>>>> rpc-transport: 'rdma' initialization failed >>>>>>>>>> >>>>>>>>>> On the client, I see: >>>>>>>>>> [2010-11-30 18:43:49.653469] W [io-stats.c:1644:init] testdir: >>>>>>>>>> dangling volume. check volfile >>>>>>>>>> [2010-11-30 18:43:49.653573] W [dict.c:1204:data_to_str] dict: >>>>>>>>>> @data=(nil) >>>>>>>>>> [2010-11-30 18:43:49.653607] W [dict.c:1204:data_to_str] dict: >>>>>>>>>> @data=(nil) >>>>>>>>>> [2010-11-30 18:43:49.736275] E [rdma.c:2066:rdma_create_cq] >>>>>>>>>> rpc-transport/rdma: testdir-client-0: creation of send_cq failed >>>>>>>>>> [2010-11-30 18:43:49.736651] E [rdma.c:3771:rdma_get_device] >>>>>>>>>> rpc-transport/rdma: testdir-client-0: could not create CQ >>>>>>>>>> [2010-11-30 18:43:49.736689] E [rdma.c:3957:rdma_init] >>>>>>>>>> rpc-transport/rdma: could not create rdma device for mthca0 >>>>>>>>>> [2010-11-30 18:43:49.736805] E [rdma.c:4789:init] testdir-client-0: >>>>>>>>>> Failed to initialize IB Device >>>>>>>>>> [2010-11-30 18:43:49.736841] E >>>>>>>>>> [rpc-transport.c:971:rpc_transport_load] rpc-transport: 'rdma' >>>>>>>>>> initialization failed >>>>>>>>>> >>>>>>>>>> This results in an unsuccessful mount. >>>>>>>>>> >>>>>>>>>> I created the mount using the following commands: >>>>>>>>>> /usr/local/glusterfs/3.1.1/sbin/gluster volume create testdir >>>>>>>>>> transport rdma submit-1:/exports >>>>>>>>>> /usr/local/glusterfs/3.1.1/sbin/gluster volume start testdir >>>>>>>>>> >>>>>>>>>> To mount the directory, I use: >>>>>>>>>> mount -t glusterfs submit-1:/testdir /mnt/glusterfs >>>>>>>>>> >>>>>>>>>> I don't think it is an Infiniband problem since GlusterFS 3.0.6 and >>>>>>>>>> GlusterFS 3.1.0 worked on the same systems. For GlusterFS 3.1.0, the >>>>>>>>>> commands listed above produced no error messages. >>>>>>>>>> >>>>>>>>>> If anyone can provide help with debugging these error messages, it >>>>>>>>>> would be appreciated. >>>>>>>>>> _______________________________________________ >>>>>>>>>> Gluster-users mailing list >>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >>>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> Gluster-users at gluster.org >>>>>>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> Gluster-users at gluster.org >>>>>>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> Gluster-users at gluster.org >>>>>>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >>>>>>> >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >>>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >>> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users >> >