Re: rdma_cm segfaults on RoCE with ConnectX-4 [WAS: Re: rping segfault with 4.9.28 on CentOS 7.3]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, May 17, 2017 at 12:14:18PM -0600, Robert LeBlanc wrote:
> Since I have a connectX-3 card in this same box, I set it up as
> Infiniband. I can run all the tests (udaddy, rping, ib_send_bw with -R
> or -z) using the Infiniband link, but the RoCE ConnectX-4 LX segfault
> on any rdma_cm communications.
>
> I put the ConnectX-3 into Ethernet mode and ran the tests again and it
> passed all of them while the ConnectX-4 LX cards still failed. We have
> some ConnectX-4 EN 100 Gb cards in other boxes that have the same
> problem.
>
> It really looks like this problem is specific to ConnectX-4 (mlx5
> driver) when running in RoCE. I _don't_ have ConnectX-4 IB cards to
> test. We are also seeing the problem with the Mellanox drivers. I
> can't find http://www.mellanox.com/page/custom_firmware_table to build
> a new OEM firmware for my SuperMicro branded cards to test the latest
> firmware.

Robert,

Please avoid top-posting, It is unreadable.

In regards to your issue, the best way to move forward is to open
customer issue request and leverage established procedures to get
proper and prompt customer channel support.

Thanks

> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, May 16, 2017 at 4:00 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
> > The ib_read_bw looks like it can use rdma_cm or not. By default, I can
> > get things to work between the nodes. If I specify -R or -z, it fails.
> > It seems that the context is not being set properly when using
> > rdma_cm.
> >
> > "Server"
> > -----------
> >
> > # ib_read_bw
> >
> > ************************************
> > * Waiting for client to connect... *
> > ************************************
> > ---------------------------------------------------------------------------------------
> >                    RDMA_Read BW Test
> > Dual-port       : OFF          Device         : mlx5_0
> > Number of qps   : 1            Transport type : IB
> > Connection type : RC           Using SRQ      : OFF
> > CQ Moderation   : 100
> > Mtu             : 1024[B]
> > Link type       : Ethernet
> > GID index       : 2
> > Outstand reads  : 16
> > rdma_cm QPs     : OFF
> > Data ex. method : Ethernet
> > ---------------------------------------------------------------------------------------
> > local address: LID 0000 QPN 0x011a PSN 0xa0e9fd OUT 0x10 RKey 0x00175e
> > VAddr 0x007fc73fd6e000
> > GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:13:13
> > remote address: LID 0000 QPN 0x011a PSN 0xf7747b OUT 0x10 RKey
> > 0x002797 VAddr 0x007fe5cccc5000
> > GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:13:14
> > ---------------------------------------------------------------------------------------
> > #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
> > 65536      1000             2728.79            2728.77            0.043660
> > ---------------------------------------------------------------------------------------
> >
> > # ib_read_bw -R
> >
> > ************************************
> > * Waiting for client to connect... *
> > ************************************
> > Segmentation fault (core dumped)
> >
> > # gdb ib_read_bw core.8319
> > GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
> > Copyright (C) 2013 Free Software Foundation, Inc.
> > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> > This is free software: you are free to change and redistribute it.
> > There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> > and "show warranty" for details.
> > This GDB was configured as "x86_64-redhat-linux-gnu".
> > For bug reporting instructions, please see:
> > <http://www.gnu.org/software/gdb/bugs/>...
> > Reading symbols from /usr/bin/ib_read_bw...Reading symbols from
> > /usr/lib/debug/usr/bin/ib_read_bw.debug...done.
> > done.
> > [New LWP 8319]
> > [Thread debugging using libthread_db enabled]
> > Using host libthread_db library "/lib64/libthread_db.so.1".
> > Core was generated by `ib_read_bw -R'.
> > Program terminated with signal 11, Segmentation fault.
> > #0  __ibv_query_device (context=0x0, device_attr=0x7ffcd8fec160) at
> > src/verbs.c:135
> > 135             return context->ops.query_device(context, device_attr);
> > (gdb) bt
> > #0  __ibv_query_device (context=0x0, device_attr=0x7ffcd8fec160) at
> > src/verbs.c:135
> > #1  0x0000000000410518 in check_for_contig_pages_support
> > (context=<optimized out>) at src/perftest_resources.c:262
> > #2  ctx_init (ctx=ctx@entry=0x110b000,
> > user_param=user_param@entry=0x110ad70) at
> > src/perftest_resources.c:1314
> > #3  0x000000000040585c in rdma_server_connect (ctx=0x110b000,
> > user_param=0x110ad70) at src/perftest_communication.c:1119
> > #4  0x0000000000405f53 in establish_connection
> > (comm=comm@entry=0x7ffcd8fec470) at src/perftest_communication.c:1244
> > #5  0x0000000000402b37 in main (argc=<optimized out>, argv=<optimized
> > out>) at src/read_bw.c:110
> > (gdb) f 0
> > #0  __ibv_query_device (context=0x0, device_attr=0x7ffcd8fec160) at
> > src/verbs.c:135
> > 135             return context->ops.query_device(context, device_attr);
> > (gdb) list
> > 130     }
> > 131
> > 132     int __ibv_query_device(struct ibv_context *context,
> > 133                            struct ibv_device_attr *device_attr)
> > 134     {
> > 135             return context->ops.query_device(context, device_attr);
> > 136     }
> > 137     default_symver(__ibv_query_device, ibv_query_device);
> > 138
> > 139     int __ibv_query_port(struct ibv_context *context, uint8_t port_num,
> > (gdb) p context
> > $1 = (struct ibv_context *) 0x0
> >
> > # ib_read_bw -z
> >
> > ************************************
> > * Waiting for client to connect... *
> > ************************************
> > Segmentation fault (core dumped)
> >
> > # gdb ib_read_bw core.8369
> > GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
> > Copyright (C) 2013 Free Software Foundation, Inc.
> > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> > This is free software: you are free to change and redistribute it.
> > There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> > and "show warranty" for details.
> > This GDB was configured as "x86_64-redhat-linux-gnu".
> > For bug reporting instructions, please see:
> > <http://www.gnu.org/software/gdb/bugs/>...
> > Reading symbols from /usr/bin/ib_read_bw...Reading symbols from
> > /usr/lib/debug/usr/bin/ib_read_bw.debug...done.
> > done.
> > [New LWP 8369]
> > [Thread debugging using libthread_db enabled]
> > Using host libthread_db library "/lib64/libthread_db.so.1".
> > Core was generated by `ib_read_bw -z'.
> > Program terminated with signal 11, Segmentation fault.
> > #0  __ibv_query_device (context=0x0, device_attr=0x7ffe5f5ee4b0) at
> > src/verbs.c:135
> > 135             return context->ops.query_device(context, device_attr);
> > (gdb) bt
> > #0  __ibv_query_device (context=0x0, device_attr=0x7ffe5f5ee4b0) at
> > src/verbs.c:135
> > #1  0x0000000000410518 in check_for_contig_pages_support
> > (context=<optimized out>) at src/perftest_resources.c:262
> > #2  ctx_init (ctx=ctx@entry=0x1b3d000,
> > user_param=user_param@entry=0x1b3cd70) at
> > src/perftest_resources.c:1314
> > #3  0x000000000040585c in rdma_server_connect (ctx=0x1b3d000,
> > user_param=0x1b3cd70)
> >    at src/perftest_communication.c:1119
> > #4  0x0000000000405f53 in establish_connection
> > (comm=comm@entry=0x7ffe5f5ee7c0) at src/perftest_communication.c:1244
> > #5  0x0000000000402b37 in main (argc=<optimized out>, argv=<optimized
> > out>) at src/read_bw.c:110
> > (gdb) f 0
> > #0  __ibv_query_device (context=0x0, device_attr=0x7ffe5f5ee4b0) at
> > src/verbs.c:135
> > 135             return context->ops.query_device(context, device_attr);
> > (gdb) list
> > 130     }
> > 131
> > 132     int __ibv_query_device(struct ibv_context *context,
> > 133                            struct ibv_device_attr *device_attr)
> > 134     {
> > 135             return context->ops.query_device(context, device_attr);
> > 136     }
> > 137     default_symver(__ibv_query_device, ibv_query_device);
> > 138
> > 139     int __ibv_query_port(struct ibv_context *context, uint8_t port_num,
> > (gdb) p context
> > $1 = (struct ibv_context *) 0x0
> >
> >
> > "Client"
> > ----------
> > # ib_read_bw 192.168.13.13
> > ---------------------------------------------------------------------------------------
> >                    RDMA_Read BW Test
> > Dual-port       : OFF          Device         : mlx5_0
> > Number of qps   : 1            Transport type : IB
> > Connection type : RC           Using SRQ      : OFF
> > TX depth        : 128
> > CQ Moderation   : 100
> > Mtu             : 1024[B]
> > Link type       : Ethernet
> > GID index       : 2
> > Outstand reads  : 16
> > rdma_cm QPs     : OFF
> > Data ex. method : Ethernet
> > ---------------------------------------------------------------------------------------
> > local address: LID 0000 QPN 0x011a PSN 0xf7747b OUT 0x10 RKey 0x002797
> > VAddr 0x007fe5cccc5000
> > GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:13:14
> > remote address: LID 0000 QPN 0x011a PSN 0xa0e9fd OUT 0x10 RKey
> > 0x00175e VAddr 0x007fc73fd6e000
> > GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:13:13
> > ---------------------------------------------------------------------------------------
> > #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
> > Conflicting CPU frequency values detected: 1200.024000 != 2600.000000.
> > CPU Frequency is not max.
> > 65536      1000             2728.79            2728.77            0.043660
> > ---------------------------------------------------------------------------------------
> >
> > # ib_read_bw -R 192.168.13.13
> > Unexpected CM event bl blka 8
> > Unable to perform rdma_client function
> > Unable to init the socket connection
> >
> > # ib_read_bw -z 192.168.13.13
> > Unexpected CM event bl blka 8
> > Unable to perform rdma_client function
> > Unable to init the socket connection
> > ----------------
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >
> >
> > On Tue, May 16, 2017 at 2:50 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
> >> I installed OFED 4.0-2.0.0.1 on a fresh snapshot with the stock kernel
> >> (3.10.0-514.16.1.el7.x86_64). I'm getting a segfault on the server
> >> side, but not on the client side. I don't see any debug packages in
> >> the OFED package to load the symbols.
> >>
> >> rping server:
> >>
> >> # gdb rping core.10405
> >> GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
> >> Copyright (C) 2013 Free Software Foundation, Inc.
> >> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> >> This is free software: you are free to change and redistribute it.
> >> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> >> and "show warranty" for details.
> >> This GDB was configured as "x86_64-redhat-linux-gnu".
> >> For bug reporting instructions, please see:
> >> <http://www.gnu.org/software/gdb/bugs/>...
> >> Reading symbols from /usr/bin/rping...Reading symbols from
> >> /usr/bin/rping...(no debugging symbols found)...done.
> >> (no debugging symbols found)...done.
> >> [New LWP 10405]
> >> [New LWP 10408]
> >> [Thread debugging using libthread_db enabled]
> >> Using host libthread_db library "/lib64/libthread_db.so.1".
> >> Core was generated by `rping -s'.
> >> Program terminated with signal 11, Segmentation fault.
> >> #0  0x00007f31883d45b4 in ibv_alloc_pd () from /usr/lib64/libibverbs.so.1
> >> Missing separate debuginfos, use: debuginfo-install
> >> librdmacm-utils-1.1.0mlnx-OFED.4.0.1.6.1.40200.x86_64
> >> (gdb) bt
> >> #0  0x00007f31883d45b4 in ibv_alloc_pd () from /usr/lib64/libibverbs.so.1
> >> #1  0x0000000000402fe6 in rping_setup_qp.isra.7 ()
> >> #2  0x0000000000401d04 in main ()
> >> (gdb) list
> >> No symbol table is loaded.  Use the "file" command.
> >>
> >> rping client:
> >>
> >> # rping -c -a 192.168.13.13
> >> cma event RDMA_CM_EVENT_REJECTED, error 28
> >> wait for CONNECTED state 4
> >> connect error -1
> >> ----------------
> >> Robert LeBlanc
> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>
> >>
> >> On Tue, May 16, 2017 at 1:23 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
> >>> This is using ConnectX-4 LX RoCE cards, using only in-box drivers.
> >>>
> >>> While trying to debug some iSER issues, I'm trying to do rping between
> >>> the two hosts, but I'm getting a segfault. Sagi suggested that there
> >>> may be something wrong with my kernel ABI. I did a make mrproper and
> >>> built the latest 4.9.28 kernel and installed the kernel headers.
> >>>
> >>> make -j 32 && sudo make modules_install && sudo make install && sudo
> >>> make headers_install INSTALL_HDR_PATH=/usr
> >>>
> >>> After booting into the new kernel, I kept getting the segfaults, so I
> >>> rebuilt the libibverbs, libibumad, librdmacm packages in case they
> >>> aren't picking up the new kernel headers. Still no luck.
> >>>
> >>> Here is the server of rping with the rebuilt packages:
> >>> # gdb rping core.22936
> >>> GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
> >>> Copyright (C) 2013 Free Software Foundation, Inc.
> >>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> >>> This is free software: you are free to change and redistribute it.
> >>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> >>> and "show warranty" for details.
> >>> This GDB was configured as "x86_64-redhat-linux-gnu".
> >>> For bug reporting instructions, please see:
> >>> <http://www.gnu.org/software/gdb/bugs/>...
> >>> Reading symbols from /usr/bin/rping...Reading symbols from
> >>> /usr/lib/debug/usr/bin/rping.debug...done.
> >>> done.
> >>> [New LWP 22936]
> >>> [New LWP 22939]
> >>> [Thread debugging using libthread_db enabled]
> >>> Using host libthread_db library "/lib64/libthread_db.so.1".
> >>> Core was generated by `rping -s'.
> >>> Program terminated with signal 11, Segmentation fault.
> >>> #0  __ibv_alloc_pd (context=0x0) at src/verbs.c:196
> >>> 196             pd = context->ops.alloc_pd(context);
> >>> (gdb) bt
> >>> #0  __ibv_alloc_pd (context=0x0) at src/verbs.c:196
> >>> #1  0x000055f60331d5f6 in rping_setup_qp (cb=cb@entry=0x55f603d74780,
> >>> cm_id=<optimized out>) at examples/rping.c:519
> >>> #2  0x000055f60331be7e in rping_run_server (cb=0x55f603d74780) at
> >>> examples/rping.c:890
> >>> #3  main (argc=2, argv=0x7ffcd16aae88) at examples/rping.c:1268
> >>> (gdb) f 0
> >>> #0  __ibv_alloc_pd (context=0x0) at src/verbs.c:196
> >>> 196             pd = context->ops.alloc_pd(context);
> >>> (gdb) list
> >>> 191
> >>> 192     struct ibv_pd *__ibv_alloc_pd(struct ibv_context *context)
> >>> 193     {
> >>> 194             struct ibv_pd *pd;
> >>> 195
> >>> 196             pd = context->ops.alloc_pd(context);
> >>> 197             if (pd)
> >>> 198                     pd->context = context;
> >>> 199
> >>> 200             return pd;
> >>> (gdb) p context
> >>> $1 = (struct ibv_context *) 0x0
> >>>
> >>> Here is the rping client that does not have the rebuilt packages:
> >>> # gdb rping core.8253
> >>> GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
> >>> Copyright (C) 2013 Free Software Foundation, Inc.
> >>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> >>> This is free software: you are free to change and redistribute it.
> >>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> >>> and "show warranty" for details.
> >>> This GDB was configured as "x86_64-redhat-linux-gnu".
> >>> For bug reporting instructions, please see:
> >>> <http://www.gnu.org/software/gdb/bugs/>...
> >>> Reading symbols from /usr/bin/rping...Reading symbols from
> >>> /usr/lib/debug/usr/bin/rping.debug...done.
> >>> done.
> >>> [New LWP 8253]
> >>> [New LWP 8256]
> >>> [Thread debugging using libthread_db enabled]
> >>> Using host libthread_db library "/lib64/libthread_db.so.1".
> >>> Core was generated by `rping -c -a 192.168.13.13'.
> >>> Program terminated with signal 11, Segmentation fault.
> >>> #0  __ibv_dereg_mr (mr=0x560e295e93b0) at src/verbs.c:299
> >>> 299             ret = mr->context->ops.dereg_mr(mr);
> >>> (gdb) bt
> >>> #0  __ibv_dereg_mr (mr=0x560e295e93b0) at src/verbs.c:299
> >>> #1  0x0000560e293cd917 in rping_free_buffers (cb=0x560e295e5780) at
> >>> examples/rping.c:470
> >>> #2  0x0000560e293cbf57 in rping_run_client (cb=<optimized out>) at
> >>> examples/rping.c:1111
> >>> #3  main (argc=<optimized out>, argv=<optimized out>) at examples/rping.c:1270
> >>> (gdb) f 9
> >>> #0  0x0000000000000000 in ?? ()
> >>> (gdb) f 0
> >>> #0  __ibv_dereg_mr (mr=0x560e295e93b0) at src/verbs.c:299
> >>> 299             ret = mr->context->ops.dereg_mr(mr);
> >>> (gdb) list
> >>> 294     {
> >>> 295             int ret;
> >>> 296             void *addr      = mr->addr;
> >>> 297             size_t length   = mr->length;
> >>> 298
> >>> 299             ret = mr->context->ops.dereg_mr(mr);
> >>> 300             if (!ret)
> >>> 301                     ibv_dofork_range(addr, length);
> >>> 302
> >>> 303             return ret;
> >>> (gdb) p mr
> >>> $1 = (struct ibv_mr *) 0x560e295e93b0
> >>> (gdb) p *mr
> >>> $2 = {context = 0x7fd423be5090, pd = 0x560e295e9960, addr =
> >>> 0x560e295e57e8, length = 16, handle = 0, lkey = 72829, rkey = 72829}
> >>> (gdb) p *mr->context
> >>> Cannot access memory at address 0x7fd423be5090
> >>>
> >>> Any ideas on what I'm doing wrong?
> >>>
> >>> Thanks,
> >>>
> >>> ----------------
> >>> Robert LeBlanc
> >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux