I tried to install other setting. I install ceph quincy by manual install method(apt package install). It works fine with config which is ms_type= async+posix. But it doesn't work with RDMA setting. However, I got error logs. #Case1: ms_type=aysnc+rdma in /etc/ceph/ceph.conf Client error has occurred. It is that it cannot make a queue pair connection with a client. root@epyc02:/home/nttsic# ceph -s ./src/msg/async/rdma/RDMAConnectedSocketImpl.cc: In function 'void RDMAConnectedSocketImpl::handle_connection()' thread 7f102e7fc640 time 2022-12-21T16:31:13.250949+0900 ./src/msg/async/rdma/RDMAConnectedSocketImpl.cc: 215: FAILED ceph_assert(!r) 2022-12-21T16:31:13.246+0900 7f102e7fc640 -1 Infiniband modify_qp_to_rtr failed to transition to RTR state: (101) Network is unreachable ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x128) [0x7f1036436369] 2: /usr/lib/x86_64-linux-gnu/ceph/libceph-common.so.2(+0x257525) [0x7f1036436525] 3: (RDMAConnectedSocketImpl::handle_connection()+0xed5) [0x7f10367ae635] 4: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x151) [0x7f1036789ce1] 5: /usr/lib/x86_64-linux-gnu/ceph/libceph-common.so.2(+0x5b2ff2) [0x7f1036791ff2] 6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc2b3) [0x7f10360912b3] 7: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f10377f8b43] 8: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f103788aa00] 2022-12-21T16:31:13.246+0900 7f102effd640 -1 Infiniband to_dead failed to send a beacon: (115) Operation now in progressAborted (core dumped) #Case2: ms_cluster_type = async+rdma in /etc/ceph/ceph.conf. And restart the ceph cluster. When I restart ceph cluster. Then OSD daemon is crushed after outputting under following logs. I confirmed the code that error was occurred. It cannot make Memory region space in RDMA connection. 2022-12-21T14:18:31.778+0900 7f3be15dc640 -1 ./src/msg/async/rdma/Infiniband.cc: In function 'int Infiniband::MemoryManager::Cluster::fill(uint32_t)' thread 7f3be15dc640 time 2022-12-21T14:18:31.775970+0900 ./src/msg/async/rdma/Infiniband.cc: 783: FAILED ceph_assert(m) ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x128) [0x558d71526799] 2: /usr/bin/ceph-osd(+0x58c955) [0x558d71526955] 3: (Infiniband::MemoryManager::Cluster::fill(unsigned int)+0x20b) [0x558d720e4ffb] 4: (Infiniband::init()+0x21f) [0x558d720e942f] 5: (RDMAWorker::listen(entity_addr_t&, unsigned int, SocketOptions const&, ServerSocket*)+0x30) [0x558d71ecf800] 6: /usr/bin/ceph-osd(+0xf19120) [0x558d71eb3120] 7: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x71e) [0x558d71ec36de] 8: /usr/bin/ceph-osd(+0xf2e9d2) [0x558d71ec89d2] 9: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc2b3) [0x7f3be60ba2b3] 10: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f3be5d41b43] 11: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f3be5dd3a00] 2022-12-21T14:18:31.786+0900 7f3be15dc640 -1 *** Caught signal (Aborted) ** in thread 7f3be15dc640 thread_name:msgr-worker-0 ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable) 1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f3be5cef520] 2: pthread_kill() I think latest Ceph doesn't work in RDMA network with ConnectX 6 (mlx5 driver). My RDMA network is fine by other RDMA tools. Regards, -- Mitsumasa KONDO 2022年12月13日(火) 19:18 Mitsumasa KONDO <kondo.mitsumasa@xxxxxxxxx>: > Hi Serkan, > > Thanks for your reply. > > -- Server setting -- > OS: Ubuntu 20.04LTS > NIC: Mellanox ConnectX-6 EN > Driver: MLNX_OFED_LINUX-5.6-2.0.9.0-ubuntu20.04-x86_64 > -- > > My ceph.conf is under following, > -- ceph.conf -- > # minimal ceph.conf for 2f383ac8-76cb-11ed-bfbc-6dd8bf17bdf9 > [global] > fsid = 2f383ac8-76cb-11ed-bfbc-6dd8bf17bdf9 > mon_host = [v2:192.168.100.11:3300/0,v1:192.168.100.11:6789/0] > [v2:192.168.100.12:3300/0,v1:192.168.100.12:6789/0] [v2: > 192.168.100.13:3300/0,v1:192.168.100.13:6789/0] [v2: > 192.168.100.14:3300/0,v1:192.168.100.14:6789/0] [v2: > 192.168.100.16:3300/0,v1:192.168.100.16:6789/0] > > ms_type = async+rdma > ms_cluster_type = async+rdma > ms_public_type = async+rdma > ms_async_rdma_device_name = mlx5_0 > ms_async_rdma_polling_us = 0 > --- > > Ceph log is here, > -- /var/log/syslog -- > Dec 12 14:50:10 192 bash[486865]: level=error ts=2022-12-12T05:50:10.114Z > caller=dispatch.go:354 component=dispatcher msg="Notify for alerts failed" > num_alerts=2 err="ceph-dashboard/webhook[0]: notify retry canceled after 8 > attempts: Post \"https://epyc01:8443/api/prometheus_receiver\": dial tcp > 127.0.1.1:8443: connect: connection refused; ceph-dashboard/webhook[1]: > notify retry canceled after 8 attempts: Post \" > https://192.168.100.13:8443/api/prometheus_receiver\": dial tcp > 192.168.100.13:8443: connect: connection refused" > Dec 12 14:50:10 192 bash[486865]: level=warn ts=2022-12-12T05:50:10.115Z > caller=notify.go:724 component=dispatcher receiver=ceph-dashboard > integration=webhook[0] msg="Notify attempt failed, will retry later" > attempts=1 err="Post \"https://epyc01:8443/api/prometheus_receiver\": > dial tcp 127.0.1.1:8443: connect: connection refused" > Dec 12 14:50:10 192 bash[486865]: level=warn ts=2022-12-12T05:50:10.115Z > caller=notify.go:724 component=dispatcher receiver=ceph-dashboard > integration=webhook[1] msg="Notify attempt failed, will retry later" > attempts=1 err="Post \" > https://192.168.100.13:8443/api/prometheus_receiver\": dial tcp > 192.168.100.13:8443: connect: connection refused" > Dec 12 14:50:10 192 bash[486864]: debug 2022-12-12T05:50:10.761+0000 > 7fa9e4174700 1 mon.epyc01@0(leader).osd e322 _set_new_cache_sizes > cache_size:1020054731 inc_alloc: 348127232 full_alloc: 348127232 kv_alloc: > 322961408 > Dec 12 14:50:15 192 bash[486864]: debug 2022-12-12T05:50:15.761+0000 > 7fa9e4174700 1 mon.epyc01@0(leader).osd e322 _set_new_cache_sizes > cache_size:1020054731 inc_alloc: 348127232 full_alloc: 348127232 kv_alloc: > 322961408 > Dec 12 14:50:19 192 bash[486904]: WARNING:ceph-crash:post > /var/lib/ceph/crash/2022-12-09T09:04:37.157002Z_d160769c-e2cc-4222-8e44-d12fb9c295d8 > as client.crash.epyc01 failed: (None, b"2022-12-12T05:50:19.669+0000 > 7f78e48d8700 -1 Infiniband verify_prereq!!! WARNING !!! For RDMA to work > properly user memlock (ulimit -l) must be big enough to allow large amount > of registered memory. We recommend setting this parameter to > infinity\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h: > In function 'DeviceList::DeviceList(ceph::common::CephContext*)' thread > 7f78ceffd700 time > 2022-12-12T05:50:19.696150+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h: > 114: FAILED ceph_assert(num)\n ceph version 17.2.5 > (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)\n 1: > (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x135) [0x7f78e955f43f]\n 2: > /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7f78e955f605]\n 3: > (DeviceList::DeviceList(ceph::common::CephContext*)+0x131) > [0x7f78e990a4b1]\n > 4: (Infiniband::init()+0xa9) [0x7f78e9908239]\n 5: > (RDMAWorker::connect(entity_addr_t const&, SocketOptions const&, > ConnectedSocket*)+0x41) [0x7f78e991e201]\n 6: > (AsyncConnection::process()+0x333) [0x7f78e98912e3]\n 7: > (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned > long, std::ratio<1l, 1000000000l> >*)+0xa74) [0x7f78e98ed6d4]\n 8: > /usr/lib64/ceph/libceph-common.so.2(+0x5fefa6) [0x7f78e98f4fa6]\n 9: > /lib64/libstdc++.so.6(+0xc2ba3) [0x7f78e82deba3]\n 10: > /lib64/libpthread.so.0(+0x81ca) [0x7f78eca771ca]\n 11: clone()\ntimeout: > the monitored command dumped core\n") > Dec 12 14:50:20 192 bash[486865]: level=error ts=2022-12-12T05:50:20.114Z > caller=dispatch.go:354 component=dispatcher msg="Notify for alerts failed" > num_alerts=2 err="ceph-dashboard/webhook[0]: notify retry canceled after 7 > attempts: Post \"https://epyc01:8443/api/prometheus_receiver\": dial tcp > 127.0.1.1:8443: connect: connection refused; ceph-dashboard/webhook[1]: > notify retry canceled after 7 attempts: Post \" > https://192.168.100.13:8443/api/prometheus_receiver\": dial tcp > 192.168.100.13:8443: connect: connection refused" > Dec 12 14:50:20 192 bash[486865]: level=warn ts=2022-12-12T05:50:20.115Z > caller=notify.go:724 component=dispatcher receiver=ceph-dashboard > integration=webhook[1] msg="Notify attempt failed, will retry later" > attempts=1 err="Post \" > https://192.168.100.13:8443/api/prometheus_receiver\": dial tcp > 192.168.100.13:8443: connect: connection refused" > Dec 12 14:50:20 192 bash[486865]: level=warn ts=2022-12-12T05:50:20.115Z > caller=notify.go:724 component=dispatcher receiver=ceph-dashboard > integration=webhook[0] msg="Notify attempt failed, will retry later" > attempts=1 err="Post \"https://epyc01:8443/api/prometheus_receiver\": > dial tcp 127.0.1.1:8443: connect: connection refused" > Dec 12 14:50:20 192 bash[486904]: WARNING:ceph-crash:post > /var/lib/ceph/crash/2022-12-09T09:04:17.768925Z_097b4858-0e4a-453b-ab26-fc1b3f3bc0f3 > as client.crash.epyc01 failed: (None, b"2022-12-12T05:50:20.073+0000 > 7f75dc246700 -1 Infiniband verify_prereq!!! WARNING !!! For RDMA to work > properly user memlock (ulimit -l) must be big enough to allow large amount > of registered memory. We recommend setting this parameter to > infinity\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h: > In function 'DeviceList::DeviceList(ceph::common::CephContext*)' thread > 7f75ca7fc700 time > 2022-12-12T05:50:20.102014+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h: > 114: FAILED ceph_assert(num)\n ceph version 17.2.5 > (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)\n 1: > (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x135) [0x7f75e0ecd43f]\n 2: > /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7f75e0ecd605]\n 3: > (DeviceList::DeviceList(ceph::common::CephContext*)+0x131) > [0x7f75e12784b1]\n > 4: (Infiniband::init()+0xa9) [0x7f75e1276239]\n 5: > (RDMAWorker::connect(entity_addr_t const&, SocketOptions const&, > ConnectedSocket*)+0x41) [0x7f75e128c201]\n 6: > (AsyncConnection::process()+0x333) [0x7f75e11ff2e3]\n 7: > (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned > long, std::ratio<1l, 1000000000l> >*)+0xa74) [0x7f75e125b6d4]\n 8: > /usr/lib64/ceph/libceph-common.so.2(+0x5fefa6) [0x7f75e1262fa6]\n 9: > /lib64/libstdc++.so.6(+0xc2ba3) [0x7f75dfc4cba3]\n 10: > /lib64/libpthread.so.0(+0x81ca) [0x7f75e43e51ca]\n 11: clone()\ntimeout: > the monitored command dumped core\n") > Dec 12 14:50:20 192 bash[486904]: WARNING:ceph-crash:post > /var/lib/ceph/crash/2022-12-09T09:05:15.560611Z_c647c8b7-e335-4026-aee9-007e4745e5b9 > as client.crash.epyc01 failed: (None, b"2022-12-12T05:50:20.481+0000 > 7f59f5689700 -1 Infiniband verify_prereq!!! WARNING !!! For RDMA to work > properly user memlock (ulimit -l) must be big enough to allow large amount > of registered memory. We recommend setting this parameter to > infinity\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h: > In function 'DeviceList::DeviceList(ceph::common::CephContext*)' thread > 7f59d77fe700 time > 2022-12-12T05:50:20.508585+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h: > 114: FAILED ceph_assert(num)\n ceph version 17.2.5 > (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)\n 1: > (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x135) [0x7f59fa31043f]\n 2: > /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7f59fa310605]\n 3: > (DeviceList::DeviceList(ceph::common::CephContext*)+0x131) > [0x7f59fa6bb4b1]\n > 4: (Infiniband::init()+0xa9) [0x7f59fa6b9239]\n 5: > (RDMAWorker::connect(entity_addr_t const&, SocketOptions const&, > ConnectedSocket*)+0x41) [0x7f59fa6cf201]\n 6: > (AsyncConnection::process()+0x333) [0x7f59fa6422e3]\n 7: > (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned > long, std::ratio<1l, 1000000000l> >*)+0xa74) [0x7f59fa69e6d4]\n 8: > /usr/lib64/ceph/libceph-common.so.2(+0x5fefa6) [0x7f59fa6a5fa6]\n 9: > /lib64/libstdc++.so.6(+0xc2ba3) [0x7f59f908fba3]\n 10: > /lib64/libpthread.so.0(+0x81ca) [0x7f59fd8281ca]\n 11: clone()\ntimeout: > the monitored command dumped core\n") > -- > > I cannot confirm the errors at /var/log/ceph/ceph-osd and > /var/log/ceph/ceph-vlolume. > If you want to see other logs, please tell me the place. > > Regards, > -- > Mitsumasa KONDO > > 2022年12月11日(日) 21:51 Serkan KARCI <karciserkan@xxxxxxxxx>: > > 2022年12月11日(日) 21:51 Serkan KARCI <karciserkan@xxxxxxxxx>: > >> Hello Mitsumasa, >> >> Could you please share your ceph.conf and ceph logs ? we manage to make >> it work as public wide except cinder service which we are working on to >> resolve. >> >> Best Regards, >> Serkan >> >> >> >> Mitsumasa KONDO <kondo.mitsumasa@xxxxxxxxx>, 9 Ara 2022 Cum, 12:36 >> tarihinde şunu yazdı: >> >>> Hi, >>> >>> I try to set rdma setting in Ceph cluster. But I set config, it's >>> stucked... >>> >>> # ceph --version >>> ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy >>> (stable) >>> # ceph config set global ms_type async+rdma >>> # ceph -s >>> 2022-12-09T17:53:04.954+0900 7f85b55b7700 -1 Infiniband verify_prereq!!! >>> WARNING !!! For RDMA to work properly user memlock (ulimit -l) must be >>> big >>> enough to allow large amount of registered memory. We recommend setting >>> this parameter to infinity >>> [[[stucked]]] >>> >>> I confirmed RDMA function in my network by rping. And I >>> set memlock unlimited setting in /etc/security/limits.conf. What should I >>> do? >>> >>> Regards, >>> -- >>> Mitsumasa KONDO >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> >> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx