On Wed, May 17, 2017 at 12:14:18PM -0600, Robert LeBlanc wrote: > Since I have a connectX-3 card in this same box, I set it up as > Infiniband. I can run all the tests (udaddy, rping, ib_send_bw with -R > or -z) using the Infiniband link, but the RoCE ConnectX-4 LX segfault > on any rdma_cm communications. > > I put the ConnectX-3 into Ethernet mode and ran the tests again and it > passed all of them while the ConnectX-4 LX cards still failed. We have > some ConnectX-4 EN 100 Gb cards in other boxes that have the same > problem. > > It really looks like this problem is specific to ConnectX-4 (mlx5 > driver) when running in RoCE. I _don't_ have ConnectX-4 IB cards to > test. We are also seeing the problem with the Mellanox drivers. I > can't find http://www.mellanox.com/page/custom_firmware_table to build > a new OEM firmware for my SuperMicro branded cards to test the latest > firmware. Robert, Please avoid top-posting, It is unreadable. In regards to your issue, the best way to move forward is to open customer issue request and leverage established procedures to get proper and prompt customer channel support. Thanks > ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Tue, May 16, 2017 at 4:00 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: > > The ib_read_bw looks like it can use rdma_cm or not. By default, I can > > get things to work between the nodes. If I specify -R or -z, it fails. > > It seems that the context is not being set properly when using > > rdma_cm. > > > > "Server" > > ----------- > > > > # ib_read_bw > > > > ************************************ > > * Waiting for client to connect... * > > ************************************ > > --------------------------------------------------------------------------------------- > > RDMA_Read BW Test > > Dual-port : OFF Device : mlx5_0 > > Number of qps : 1 Transport type : IB > > Connection type : RC Using SRQ : OFF > > CQ Moderation : 100 > > Mtu : 1024[B] > > Link type : Ethernet > > GID index : 2 > > Outstand reads : 16 > > rdma_cm QPs : OFF > > Data ex. method : Ethernet > > --------------------------------------------------------------------------------------- > > local address: LID 0000 QPN 0x011a PSN 0xa0e9fd OUT 0x10 RKey 0x00175e > > VAddr 0x007fc73fd6e000 > > GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:13:13 > > remote address: LID 0000 QPN 0x011a PSN 0xf7747b OUT 0x10 RKey > > 0x002797 VAddr 0x007fe5cccc5000 > > GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:13:14 > > --------------------------------------------------------------------------------------- > > #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] > > 65536 1000 2728.79 2728.77 0.043660 > > --------------------------------------------------------------------------------------- > > > > # ib_read_bw -R > > > > ************************************ > > * Waiting for client to connect... * > > ************************************ > > Segmentation fault (core dumped) > > > > # gdb ib_read_bw core.8319 > > GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7 > > Copyright (C) 2013 Free Software Foundation, Inc. > > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > > This is free software: you are free to change and redistribute it. > > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > > and "show warranty" for details. > > This GDB was configured as "x86_64-redhat-linux-gnu". > > For bug reporting instructions, please see: > > <http://www.gnu.org/software/gdb/bugs/>... > > Reading symbols from /usr/bin/ib_read_bw...Reading symbols from > > /usr/lib/debug/usr/bin/ib_read_bw.debug...done. > > done. > > [New LWP 8319] > > [Thread debugging using libthread_db enabled] > > Using host libthread_db library "/lib64/libthread_db.so.1". > > Core was generated by `ib_read_bw -R'. > > Program terminated with signal 11, Segmentation fault. > > #0 __ibv_query_device (context=0x0, device_attr=0x7ffcd8fec160) at > > src/verbs.c:135 > > 135 return context->ops.query_device(context, device_attr); > > (gdb) bt > > #0 __ibv_query_device (context=0x0, device_attr=0x7ffcd8fec160) at > > src/verbs.c:135 > > #1 0x0000000000410518 in check_for_contig_pages_support > > (context=<optimized out>) at src/perftest_resources.c:262 > > #2 ctx_init (ctx=ctx@entry=0x110b000, > > user_param=user_param@entry=0x110ad70) at > > src/perftest_resources.c:1314 > > #3 0x000000000040585c in rdma_server_connect (ctx=0x110b000, > > user_param=0x110ad70) at src/perftest_communication.c:1119 > > #4 0x0000000000405f53 in establish_connection > > (comm=comm@entry=0x7ffcd8fec470) at src/perftest_communication.c:1244 > > #5 0x0000000000402b37 in main (argc=<optimized out>, argv=<optimized > > out>) at src/read_bw.c:110 > > (gdb) f 0 > > #0 __ibv_query_device (context=0x0, device_attr=0x7ffcd8fec160) at > > src/verbs.c:135 > > 135 return context->ops.query_device(context, device_attr); > > (gdb) list > > 130 } > > 131 > > 132 int __ibv_query_device(struct ibv_context *context, > > 133 struct ibv_device_attr *device_attr) > > 134 { > > 135 return context->ops.query_device(context, device_attr); > > 136 } > > 137 default_symver(__ibv_query_device, ibv_query_device); > > 138 > > 139 int __ibv_query_port(struct ibv_context *context, uint8_t port_num, > > (gdb) p context > > $1 = (struct ibv_context *) 0x0 > > > > # ib_read_bw -z > > > > ************************************ > > * Waiting for client to connect... * > > ************************************ > > Segmentation fault (core dumped) > > > > # gdb ib_read_bw core.8369 > > GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7 > > Copyright (C) 2013 Free Software Foundation, Inc. > > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > > This is free software: you are free to change and redistribute it. > > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > > and "show warranty" for details. > > This GDB was configured as "x86_64-redhat-linux-gnu". > > For bug reporting instructions, please see: > > <http://www.gnu.org/software/gdb/bugs/>... > > Reading symbols from /usr/bin/ib_read_bw...Reading symbols from > > /usr/lib/debug/usr/bin/ib_read_bw.debug...done. > > done. > > [New LWP 8369] > > [Thread debugging using libthread_db enabled] > > Using host libthread_db library "/lib64/libthread_db.so.1". > > Core was generated by `ib_read_bw -z'. > > Program terminated with signal 11, Segmentation fault. > > #0 __ibv_query_device (context=0x0, device_attr=0x7ffe5f5ee4b0) at > > src/verbs.c:135 > > 135 return context->ops.query_device(context, device_attr); > > (gdb) bt > > #0 __ibv_query_device (context=0x0, device_attr=0x7ffe5f5ee4b0) at > > src/verbs.c:135 > > #1 0x0000000000410518 in check_for_contig_pages_support > > (context=<optimized out>) at src/perftest_resources.c:262 > > #2 ctx_init (ctx=ctx@entry=0x1b3d000, > > user_param=user_param@entry=0x1b3cd70) at > > src/perftest_resources.c:1314 > > #3 0x000000000040585c in rdma_server_connect (ctx=0x1b3d000, > > user_param=0x1b3cd70) > > at src/perftest_communication.c:1119 > > #4 0x0000000000405f53 in establish_connection > > (comm=comm@entry=0x7ffe5f5ee7c0) at src/perftest_communication.c:1244 > > #5 0x0000000000402b37 in main (argc=<optimized out>, argv=<optimized > > out>) at src/read_bw.c:110 > > (gdb) f 0 > > #0 __ibv_query_device (context=0x0, device_attr=0x7ffe5f5ee4b0) at > > src/verbs.c:135 > > 135 return context->ops.query_device(context, device_attr); > > (gdb) list > > 130 } > > 131 > > 132 int __ibv_query_device(struct ibv_context *context, > > 133 struct ibv_device_attr *device_attr) > > 134 { > > 135 return context->ops.query_device(context, device_attr); > > 136 } > > 137 default_symver(__ibv_query_device, ibv_query_device); > > 138 > > 139 int __ibv_query_port(struct ibv_context *context, uint8_t port_num, > > (gdb) p context > > $1 = (struct ibv_context *) 0x0 > > > > > > "Client" > > ---------- > > # ib_read_bw 192.168.13.13 > > --------------------------------------------------------------------------------------- > > RDMA_Read BW Test > > Dual-port : OFF Device : mlx5_0 > > Number of qps : 1 Transport type : IB > > Connection type : RC Using SRQ : OFF > > TX depth : 128 > > CQ Moderation : 100 > > Mtu : 1024[B] > > Link type : Ethernet > > GID index : 2 > > Outstand reads : 16 > > rdma_cm QPs : OFF > > Data ex. method : Ethernet > > --------------------------------------------------------------------------------------- > > local address: LID 0000 QPN 0x011a PSN 0xf7747b OUT 0x10 RKey 0x002797 > > VAddr 0x007fe5cccc5000 > > GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:13:14 > > remote address: LID 0000 QPN 0x011a PSN 0xa0e9fd OUT 0x10 RKey > > 0x00175e VAddr 0x007fc73fd6e000 > > GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:13:13 > > --------------------------------------------------------------------------------------- > > #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] > > Conflicting CPU frequency values detected: 1200.024000 != 2600.000000. > > CPU Frequency is not max. > > 65536 1000 2728.79 2728.77 0.043660 > > --------------------------------------------------------------------------------------- > > > > # ib_read_bw -R 192.168.13.13 > > Unexpected CM event bl blka 8 > > Unable to perform rdma_client function > > Unable to init the socket connection > > > > # ib_read_bw -z 192.168.13.13 > > Unexpected CM event bl blka 8 > > Unable to perform rdma_client function > > Unable to init the socket connection > > ---------------- > > Robert LeBlanc > > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > > > > On Tue, May 16, 2017 at 2:50 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: > >> I installed OFED 4.0-2.0.0.1 on a fresh snapshot with the stock kernel > >> (3.10.0-514.16.1.el7.x86_64). I'm getting a segfault on the server > >> side, but not on the client side. I don't see any debug packages in > >> the OFED package to load the symbols. > >> > >> rping server: > >> > >> # gdb rping core.10405 > >> GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7 > >> Copyright (C) 2013 Free Software Foundation, Inc. > >> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > >> This is free software: you are free to change and redistribute it. > >> There is NO WARRANTY, to the extent permitted by law. Type "show copying" > >> and "show warranty" for details. > >> This GDB was configured as "x86_64-redhat-linux-gnu". > >> For bug reporting instructions, please see: > >> <http://www.gnu.org/software/gdb/bugs/>... > >> Reading symbols from /usr/bin/rping...Reading symbols from > >> /usr/bin/rping...(no debugging symbols found)...done. > >> (no debugging symbols found)...done. > >> [New LWP 10405] > >> [New LWP 10408] > >> [Thread debugging using libthread_db enabled] > >> Using host libthread_db library "/lib64/libthread_db.so.1". > >> Core was generated by `rping -s'. > >> Program terminated with signal 11, Segmentation fault. > >> #0 0x00007f31883d45b4 in ibv_alloc_pd () from /usr/lib64/libibverbs.so.1 > >> Missing separate debuginfos, use: debuginfo-install > >> librdmacm-utils-1.1.0mlnx-OFED.4.0.1.6.1.40200.x86_64 > >> (gdb) bt > >> #0 0x00007f31883d45b4 in ibv_alloc_pd () from /usr/lib64/libibverbs.so.1 > >> #1 0x0000000000402fe6 in rping_setup_qp.isra.7 () > >> #2 0x0000000000401d04 in main () > >> (gdb) list > >> No symbol table is loaded. Use the "file" command. > >> > >> rping client: > >> > >> # rping -c -a 192.168.13.13 > >> cma event RDMA_CM_EVENT_REJECTED, error 28 > >> wait for CONNECTED state 4 > >> connect error -1 > >> ---------------- > >> Robert LeBlanc > >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > >> > >> > >> On Tue, May 16, 2017 at 1:23 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: > >>> This is using ConnectX-4 LX RoCE cards, using only in-box drivers. > >>> > >>> While trying to debug some iSER issues, I'm trying to do rping between > >>> the two hosts, but I'm getting a segfault. Sagi suggested that there > >>> may be something wrong with my kernel ABI. I did a make mrproper and > >>> built the latest 4.9.28 kernel and installed the kernel headers. > >>> > >>> make -j 32 && sudo make modules_install && sudo make install && sudo > >>> make headers_install INSTALL_HDR_PATH=/usr > >>> > >>> After booting into the new kernel, I kept getting the segfaults, so I > >>> rebuilt the libibverbs, libibumad, librdmacm packages in case they > >>> aren't picking up the new kernel headers. Still no luck. > >>> > >>> Here is the server of rping with the rebuilt packages: > >>> # gdb rping core.22936 > >>> GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7 > >>> Copyright (C) 2013 Free Software Foundation, Inc. > >>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > >>> This is free software: you are free to change and redistribute it. > >>> There is NO WARRANTY, to the extent permitted by law. Type "show copying" > >>> and "show warranty" for details. > >>> This GDB was configured as "x86_64-redhat-linux-gnu". > >>> For bug reporting instructions, please see: > >>> <http://www.gnu.org/software/gdb/bugs/>... > >>> Reading symbols from /usr/bin/rping...Reading symbols from > >>> /usr/lib/debug/usr/bin/rping.debug...done. > >>> done. > >>> [New LWP 22936] > >>> [New LWP 22939] > >>> [Thread debugging using libthread_db enabled] > >>> Using host libthread_db library "/lib64/libthread_db.so.1". > >>> Core was generated by `rping -s'. > >>> Program terminated with signal 11, Segmentation fault. > >>> #0 __ibv_alloc_pd (context=0x0) at src/verbs.c:196 > >>> 196 pd = context->ops.alloc_pd(context); > >>> (gdb) bt > >>> #0 __ibv_alloc_pd (context=0x0) at src/verbs.c:196 > >>> #1 0x000055f60331d5f6 in rping_setup_qp (cb=cb@entry=0x55f603d74780, > >>> cm_id=<optimized out>) at examples/rping.c:519 > >>> #2 0x000055f60331be7e in rping_run_server (cb=0x55f603d74780) at > >>> examples/rping.c:890 > >>> #3 main (argc=2, argv=0x7ffcd16aae88) at examples/rping.c:1268 > >>> (gdb) f 0 > >>> #0 __ibv_alloc_pd (context=0x0) at src/verbs.c:196 > >>> 196 pd = context->ops.alloc_pd(context); > >>> (gdb) list > >>> 191 > >>> 192 struct ibv_pd *__ibv_alloc_pd(struct ibv_context *context) > >>> 193 { > >>> 194 struct ibv_pd *pd; > >>> 195 > >>> 196 pd = context->ops.alloc_pd(context); > >>> 197 if (pd) > >>> 198 pd->context = context; > >>> 199 > >>> 200 return pd; > >>> (gdb) p context > >>> $1 = (struct ibv_context *) 0x0 > >>> > >>> Here is the rping client that does not have the rebuilt packages: > >>> # gdb rping core.8253 > >>> GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7 > >>> Copyright (C) 2013 Free Software Foundation, Inc. > >>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > >>> This is free software: you are free to change and redistribute it. > >>> There is NO WARRANTY, to the extent permitted by law. Type "show copying" > >>> and "show warranty" for details. > >>> This GDB was configured as "x86_64-redhat-linux-gnu". > >>> For bug reporting instructions, please see: > >>> <http://www.gnu.org/software/gdb/bugs/>... > >>> Reading symbols from /usr/bin/rping...Reading symbols from > >>> /usr/lib/debug/usr/bin/rping.debug...done. > >>> done. > >>> [New LWP 8253] > >>> [New LWP 8256] > >>> [Thread debugging using libthread_db enabled] > >>> Using host libthread_db library "/lib64/libthread_db.so.1". > >>> Core was generated by `rping -c -a 192.168.13.13'. > >>> Program terminated with signal 11, Segmentation fault. > >>> #0 __ibv_dereg_mr (mr=0x560e295e93b0) at src/verbs.c:299 > >>> 299 ret = mr->context->ops.dereg_mr(mr); > >>> (gdb) bt > >>> #0 __ibv_dereg_mr (mr=0x560e295e93b0) at src/verbs.c:299 > >>> #1 0x0000560e293cd917 in rping_free_buffers (cb=0x560e295e5780) at > >>> examples/rping.c:470 > >>> #2 0x0000560e293cbf57 in rping_run_client (cb=<optimized out>) at > >>> examples/rping.c:1111 > >>> #3 main (argc=<optimized out>, argv=<optimized out>) at examples/rping.c:1270 > >>> (gdb) f 9 > >>> #0 0x0000000000000000 in ?? () > >>> (gdb) f 0 > >>> #0 __ibv_dereg_mr (mr=0x560e295e93b0) at src/verbs.c:299 > >>> 299 ret = mr->context->ops.dereg_mr(mr); > >>> (gdb) list > >>> 294 { > >>> 295 int ret; > >>> 296 void *addr = mr->addr; > >>> 297 size_t length = mr->length; > >>> 298 > >>> 299 ret = mr->context->ops.dereg_mr(mr); > >>> 300 if (!ret) > >>> 301 ibv_dofork_range(addr, length); > >>> 302 > >>> 303 return ret; > >>> (gdb) p mr > >>> $1 = (struct ibv_mr *) 0x560e295e93b0 > >>> (gdb) p *mr > >>> $2 = {context = 0x7fd423be5090, pd = 0x560e295e9960, addr = > >>> 0x560e295e57e8, length = 16, handle = 0, lkey = 72829, rkey = 72829} > >>> (gdb) p *mr->context > >>> Cannot access memory at address 0x7fd423be5090 > >>> > >>> Any ideas on what I'm doing wrong? > >>> > >>> Thanks, > >>> > >>> ---------------- > >>> Robert LeBlanc > >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html
Attachment:
signature.asc
Description: PGP signature