Re: Set async+rdma in Ceph cluster, then stuck

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I tried to install other setting. I install ceph quincy by manual install
method(apt package install).
It works fine with config which is ms_type= async+posix. But it doesn't
work with RDMA setting. However, I got error logs.

#Case1: ms_type=aysnc+rdma in /etc/ceph/ceph.conf
Client error has occurred. It is that it cannot make a queue pair
connection with a client.

root@epyc02:/home/nttsic# ceph -s
./src/msg/async/rdma/RDMAConnectedSocketImpl.cc: In function 'void
RDMAConnectedSocketImpl::handle_connection()' thread 7f102e7fc640 time
2022-12-21T16:31:13.250949+0900
./src/msg/async/rdma/RDMAConnectedSocketImpl.cc: 215: FAILED ceph_assert(!r)
2022-12-21T16:31:13.246+0900 7f102e7fc640 -1 Infiniband modify_qp_to_rtr
failed to transition to RTR state: (101) Network is unreachable
 ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x128) [0x7f1036436369]
 2: /usr/lib/x86_64-linux-gnu/ceph/libceph-common.so.2(+0x257525)
[0x7f1036436525]
 3: (RDMAConnectedSocketImpl::handle_connection()+0xed5) [0x7f10367ae635]
 4: (EventCenter::process_events(unsigned int,
std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x151)
[0x7f1036789ce1]
 5: /usr/lib/x86_64-linux-gnu/ceph/libceph-common.so.2(+0x5b2ff2)
[0x7f1036791ff2]
 6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc2b3) [0x7f10360912b3]
 7: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f10377f8b43]
 8: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f103788aa00]
2022-12-21T16:31:13.246+0900 7f102effd640 -1 Infiniband to_dead failed to
send a beacon: (115) Operation now in progressAborted (core dumped)


#Case2: ms_cluster_type = async+rdma in /etc/ceph/ceph.conf. And restart
the ceph cluster.
When I restart ceph cluster. Then OSD daemon is crushed after outputting
under following logs.
I confirmed the code that error was occurred. It cannot make Memory region
space in RDMA connection.

2022-12-21T14:18:31.778+0900 7f3be15dc640 -1
./src/msg/async/rdma/Infiniband.cc: In function 'int
Infiniband::MemoryManager::Cluster::fill(uint32_t)' thread 7f3be15dc640
time 2022-12-21T14:18:31.775970+0900
./src/msg/async/rdma/Infiniband.cc: 783: FAILED ceph_assert(m)

 ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x128) [0x558d71526799]
 2: /usr/bin/ceph-osd(+0x58c955) [0x558d71526955]
 3: (Infiniband::MemoryManager::Cluster::fill(unsigned int)+0x20b)
[0x558d720e4ffb]
 4: (Infiniband::init()+0x21f) [0x558d720e942f]
 5: (RDMAWorker::listen(entity_addr_t&, unsigned int, SocketOptions const&,
ServerSocket*)+0x30) [0x558d71ecf800]
 6: /usr/bin/ceph-osd(+0xf19120) [0x558d71eb3120]
 7: (EventCenter::process_events(unsigned int,
std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x71e)
[0x558d71ec36de]
 8: /usr/bin/ceph-osd(+0xf2e9d2) [0x558d71ec89d2]
 9: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc2b3) [0x7f3be60ba2b3]
 10: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f3be5d41b43]
 11: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f3be5dd3a00]

2022-12-21T14:18:31.786+0900 7f3be15dc640 -1 *** Caught signal (Aborted) **
 in thread 7f3be15dc640 thread_name:msgr-worker-0

 ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy
(stable)
 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f3be5cef520]
 2: pthread_kill()

I think latest Ceph doesn't work in RDMA network with ConnectX 6 (mlx5
driver). My RDMA network is fine by other RDMA tools.

Regards,
--
Mitsumasa KONDO


2022年12月13日(火) 19:18 Mitsumasa KONDO <kondo.mitsumasa@xxxxxxxxx>:

> Hi Serkan,
>
> Thanks for your reply.
>
> -- Server setting --
> OS: Ubuntu 20.04LTS
> NIC: Mellanox ConnectX-6 EN
> Driver: MLNX_OFED_LINUX-5.6-2.0.9.0-ubuntu20.04-x86_64
> --
>
> My ceph.conf is under following,
> --  ceph.conf  --
> # minimal ceph.conf for 2f383ac8-76cb-11ed-bfbc-6dd8bf17bdf9
> [global]
>         fsid = 2f383ac8-76cb-11ed-bfbc-6dd8bf17bdf9
>         mon_host = [v2:192.168.100.11:3300/0,v1:192.168.100.11:6789/0]
> [v2:192.168.100.12:3300/0,v1:192.168.100.12:6789/0] [v2:
> 192.168.100.13:3300/0,v1:192.168.100.13:6789/0] [v2:
> 192.168.100.14:3300/0,v1:192.168.100.14:6789/0] [v2:
> 192.168.100.16:3300/0,v1:192.168.100.16:6789/0]
>
>         ms_type = async+rdma
>         ms_cluster_type = async+rdma
>         ms_public_type = async+rdma
>         ms_async_rdma_device_name = mlx5_0
>         ms_async_rdma_polling_us = 0
> ---
>
> Ceph log is here,
> -- /var/log/syslog --
> Dec 12 14:50:10 192 bash[486865]: level=error ts=2022-12-12T05:50:10.114Z
> caller=dispatch.go:354 component=dispatcher msg="Notify for alerts failed"
> num_alerts=2 err="ceph-dashboard/webhook[0]: notify retry canceled after 8
> attempts: Post \"https://epyc01:8443/api/prometheus_receiver\": dial tcp
> 127.0.1.1:8443: connect: connection refused; ceph-dashboard/webhook[1]:
> notify retry canceled after 8 attempts: Post \"
> https://192.168.100.13:8443/api/prometheus_receiver\": dial tcp
> 192.168.100.13:8443: connect: connection refused"
> Dec 12 14:50:10 192 bash[486865]: level=warn ts=2022-12-12T05:50:10.115Z
> caller=notify.go:724 component=dispatcher receiver=ceph-dashboard
> integration=webhook[0] msg="Notify attempt failed, will retry later"
> attempts=1 err="Post \"https://epyc01:8443/api/prometheus_receiver\":
> dial tcp 127.0.1.1:8443: connect: connection refused"
> Dec 12 14:50:10 192 bash[486865]: level=warn ts=2022-12-12T05:50:10.115Z
> caller=notify.go:724 component=dispatcher receiver=ceph-dashboard
> integration=webhook[1] msg="Notify attempt failed, will retry later"
> attempts=1 err="Post \"
> https://192.168.100.13:8443/api/prometheus_receiver\": dial tcp
> 192.168.100.13:8443: connect: connection refused"
> Dec 12 14:50:10 192 bash[486864]: debug 2022-12-12T05:50:10.761+0000
> 7fa9e4174700  1 mon.epyc01@0(leader).osd e322 _set_new_cache_sizes
> cache_size:1020054731 inc_alloc: 348127232 full_alloc: 348127232 kv_alloc:
> 322961408
> Dec 12 14:50:15 192 bash[486864]: debug 2022-12-12T05:50:15.761+0000
> 7fa9e4174700  1 mon.epyc01@0(leader).osd e322 _set_new_cache_sizes
> cache_size:1020054731 inc_alloc: 348127232 full_alloc: 348127232 kv_alloc:
> 322961408
> Dec 12 14:50:19 192 bash[486904]: WARNING:ceph-crash:post
> /var/lib/ceph/crash/2022-12-09T09:04:37.157002Z_d160769c-e2cc-4222-8e44-d12fb9c295d8
> as client.crash.epyc01 failed: (None, b"2022-12-12T05:50:19.669+0000
> 7f78e48d8700 -1 Infiniband verify_prereq!!! WARNING !!! For RDMA to work
> properly user memlock (ulimit -l) must be big enough to allow large amount
> of registered memory. We recommend setting this parameter to
> infinity\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h:
> In function 'DeviceList::DeviceList(ceph::common::CephContext*)' thread
> 7f78ceffd700 time
> 2022-12-12T05:50:19.696150+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h:
> 114: FAILED ceph_assert(num)\n ceph version 17.2.5
> (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)\n 1:
> (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x135) [0x7f78e955f43f]\n 2:
> /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7f78e955f605]\n 3:
> (DeviceList::DeviceList(ceph::common::CephContext*)+0x131)
> [0x7f78e990a4b1]\n
> 4: (Infiniband::init()+0xa9) [0x7f78e9908239]\n 5:
> (RDMAWorker::connect(entity_addr_t const&, SocketOptions const&,
> ConnectedSocket*)+0x41) [0x7f78e991e201]\n 6:
> (AsyncConnection::process()+0x333) [0x7f78e98912e3]\n 7:
> (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned
> long, std::ratio<1l, 1000000000l> >*)+0xa74) [0x7f78e98ed6d4]\n 8:
> /usr/lib64/ceph/libceph-common.so.2(+0x5fefa6) [0x7f78e98f4fa6]\n 9:
> /lib64/libstdc++.so.6(+0xc2ba3) [0x7f78e82deba3]\n 10:
> /lib64/libpthread.so.0(+0x81ca) [0x7f78eca771ca]\n 11: clone()\ntimeout:
> the monitored command dumped core\n")
> Dec 12 14:50:20 192 bash[486865]: level=error ts=2022-12-12T05:50:20.114Z
> caller=dispatch.go:354 component=dispatcher msg="Notify for alerts failed"
> num_alerts=2 err="ceph-dashboard/webhook[0]: notify retry canceled after 7
> attempts: Post \"https://epyc01:8443/api/prometheus_receiver\": dial tcp
> 127.0.1.1:8443: connect: connection refused; ceph-dashboard/webhook[1]:
> notify retry canceled after 7 attempts: Post \"
> https://192.168.100.13:8443/api/prometheus_receiver\": dial tcp
> 192.168.100.13:8443: connect: connection refused"
> Dec 12 14:50:20 192 bash[486865]: level=warn ts=2022-12-12T05:50:20.115Z
> caller=notify.go:724 component=dispatcher receiver=ceph-dashboard
> integration=webhook[1] msg="Notify attempt failed, will retry later"
> attempts=1 err="Post \"
> https://192.168.100.13:8443/api/prometheus_receiver\": dial tcp
> 192.168.100.13:8443: connect: connection refused"
> Dec 12 14:50:20 192 bash[486865]: level=warn ts=2022-12-12T05:50:20.115Z
> caller=notify.go:724 component=dispatcher receiver=ceph-dashboard
> integration=webhook[0] msg="Notify attempt failed, will retry later"
> attempts=1 err="Post \"https://epyc01:8443/api/prometheus_receiver\":
> dial tcp 127.0.1.1:8443: connect: connection refused"
> Dec 12 14:50:20 192 bash[486904]: WARNING:ceph-crash:post
> /var/lib/ceph/crash/2022-12-09T09:04:17.768925Z_097b4858-0e4a-453b-ab26-fc1b3f3bc0f3
> as client.crash.epyc01 failed: (None, b"2022-12-12T05:50:20.073+0000
> 7f75dc246700 -1 Infiniband verify_prereq!!! WARNING !!! For RDMA to work
> properly user memlock (ulimit -l) must be big enough to allow large amount
> of registered memory. We recommend setting this parameter to
> infinity\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h:
> In function 'DeviceList::DeviceList(ceph::common::CephContext*)' thread
> 7f75ca7fc700 time
> 2022-12-12T05:50:20.102014+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h:
> 114: FAILED ceph_assert(num)\n ceph version 17.2.5
> (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)\n 1:
> (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x135) [0x7f75e0ecd43f]\n 2:
> /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7f75e0ecd605]\n 3:
> (DeviceList::DeviceList(ceph::common::CephContext*)+0x131)
> [0x7f75e12784b1]\n
> 4: (Infiniband::init()+0xa9) [0x7f75e1276239]\n 5:
> (RDMAWorker::connect(entity_addr_t const&, SocketOptions const&,
> ConnectedSocket*)+0x41) [0x7f75e128c201]\n 6:
> (AsyncConnection::process()+0x333) [0x7f75e11ff2e3]\n 7:
> (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned
> long, std::ratio<1l, 1000000000l> >*)+0xa74) [0x7f75e125b6d4]\n 8:
> /usr/lib64/ceph/libceph-common.so.2(+0x5fefa6) [0x7f75e1262fa6]\n 9:
> /lib64/libstdc++.so.6(+0xc2ba3) [0x7f75dfc4cba3]\n 10:
> /lib64/libpthread.so.0(+0x81ca) [0x7f75e43e51ca]\n 11: clone()\ntimeout:
> the monitored command dumped core\n")
> Dec 12 14:50:20 192 bash[486904]: WARNING:ceph-crash:post
> /var/lib/ceph/crash/2022-12-09T09:05:15.560611Z_c647c8b7-e335-4026-aee9-007e4745e5b9
> as client.crash.epyc01 failed: (None, b"2022-12-12T05:50:20.481+0000
> 7f59f5689700 -1 Infiniband verify_prereq!!! WARNING !!! For RDMA to work
> properly user memlock (ulimit -l) must be big enough to allow large amount
> of registered memory. We recommend setting this parameter to
> infinity\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h:
> In function 'DeviceList::DeviceList(ceph::common::CephContext*)' thread
> 7f59d77fe700 time
> 2022-12-12T05:50:20.508585+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h:
> 114: FAILED ceph_assert(num)\n ceph version 17.2.5
> (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)\n 1:
> (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x135) [0x7f59fa31043f]\n 2:
> /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7f59fa310605]\n 3:
> (DeviceList::DeviceList(ceph::common::CephContext*)+0x131)
> [0x7f59fa6bb4b1]\n
> 4: (Infiniband::init()+0xa9) [0x7f59fa6b9239]\n 5:
> (RDMAWorker::connect(entity_addr_t const&, SocketOptions const&,
> ConnectedSocket*)+0x41) [0x7f59fa6cf201]\n 6:
> (AsyncConnection::process()+0x333) [0x7f59fa6422e3]\n 7:
> (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned
> long, std::ratio<1l, 1000000000l> >*)+0xa74) [0x7f59fa69e6d4]\n 8:
> /usr/lib64/ceph/libceph-common.so.2(+0x5fefa6) [0x7f59fa6a5fa6]\n 9:
> /lib64/libstdc++.so.6(+0xc2ba3) [0x7f59f908fba3]\n 10:
> /lib64/libpthread.so.0(+0x81ca) [0x7f59fd8281ca]\n 11: clone()\ntimeout:
> the monitored command dumped core\n")
> --
>
> I cannot confirm the errors at /var/log/ceph/ceph-osd and
> /var/log/ceph/ceph-vlolume.
> If you want to see other logs, please tell me the place.
>
> Regards,
> --
> Mitsumasa KONDO
>
> 2022年12月11日(日) 21:51 Serkan KARCI <karciserkan@xxxxxxxxx>:
>
> 2022年12月11日(日) 21:51 Serkan KARCI <karciserkan@xxxxxxxxx>:
>
>> Hello Mitsumasa,
>>
>> Could you please share your ceph.conf and ceph logs ? we manage to make
>> it work as public wide except cinder service which we are working on to
>> resolve.
>>
>> Best Regards,
>> Serkan
>>
>>
>>
>> Mitsumasa KONDO <kondo.mitsumasa@xxxxxxxxx>, 9 Ara 2022 Cum, 12:36
>> tarihinde şunu yazdı:
>>
>>> Hi,
>>>
>>> I try to set rdma setting in Ceph cluster. But I set config, it's
>>> stucked...
>>>
>>> # ceph --version
>>> ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy
>>> (stable)
>>> # ceph config set global ms_type async+rdma
>>> # ceph -s
>>> 2022-12-09T17:53:04.954+0900 7f85b55b7700 -1 Infiniband verify_prereq!!!
>>> WARNING !!! For RDMA to work properly user memlock (ulimit -l) must be
>>> big
>>> enough to allow large amount of registered memory. We recommend setting
>>> this parameter to infinity
>>> [[[stucked]]]
>>>
>>> I confirmed RDMA function in my network by rping. And I
>>> set memlock unlimited setting in /etc/security/limits.conf. What should I
>>> do?
>>>
>>> Regards,
>>> --
>>> Mitsumasa KONDO
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux