Hi Serkan, Thanks for your reply. -- Server setting -- OS: Ubuntu 20.04LTS NIC: Mellanox ConnectX-6 EN Driver: MLNX_OFED_LINUX-5.6-2.0.9.0-ubuntu20.04-x86_64 -- My ceph.conf is under following, -- ceph.conf -- # minimal ceph.conf for 2f383ac8-76cb-11ed-bfbc-6dd8bf17bdf9 [global] fsid = 2f383ac8-76cb-11ed-bfbc-6dd8bf17bdf9 mon_host = [v2:192.168.100.11:3300/0,v1:192.168.100.11:6789/0] [v2: 192.168.100.12:3300/0,v1:192.168.100.12:6789/0] [v2: 192.168.100.13:3300/0,v1:192.168.100.13:6789/0] [v2: 192.168.100.14:3300/0,v1:192.168.100.14:6789/0] [v2: 192.168.100.16:3300/0,v1:192.168.100.16:6789/0] ms_type = async+rdma ms_cluster_type = async+rdma ms_public_type = async+rdma ms_async_rdma_device_name = mlx5_0 ms_async_rdma_polling_us = 0 --- Ceph log is here, -- /var/log/syslog -- Dec 12 14:50:10 192 bash[486865]: level=error ts=2022-12-12T05:50:10.114Z caller=dispatch.go:354 component=dispatcher msg="Notify for alerts failed" num_alerts=2 err="ceph-dashboard/webhook[0]: notify retry canceled after 8 attempts: Post \"https://epyc01:8443/api/prometheus_receiver\": dial tcp 127.0.1.1:8443: connect: connection refused; ceph-dashboard/webhook[1]: notify retry canceled after 8 attempts: Post \" https://192.168.100.13:8443/api/prometheus_receiver\": dial tcp 192.168.100.13:8443: connect: connection refused" Dec 12 14:50:10 192 bash[486865]: level=warn ts=2022-12-12T05:50:10.115Z caller=notify.go:724 component=dispatcher receiver=ceph-dashboard integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://epyc01:8443/api/prometheus_receiver\": dial tcp 127.0.1.1:8443: connect: connection refused" Dec 12 14:50:10 192 bash[486865]: level=warn ts=2022-12-12T05:50:10.115Z caller=notify.go:724 component=dispatcher receiver=ceph-dashboard integration=webhook[1] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://192.168.100.13:8443/api/prometheus_receiver\": dial tcp 192.168.100.13:8443: connect: connection refused" Dec 12 14:50:10 192 bash[486864]: debug 2022-12-12T05:50:10.761+0000 7fa9e4174700 1 mon.epyc01@0(leader).osd e322 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 348127232 full_alloc: 348127232 kv_alloc: 322961408 Dec 12 14:50:15 192 bash[486864]: debug 2022-12-12T05:50:15.761+0000 7fa9e4174700 1 mon.epyc01@0(leader).osd e322 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 348127232 full_alloc: 348127232 kv_alloc: 322961408 Dec 12 14:50:19 192 bash[486904]: WARNING:ceph-crash:post /var/lib/ceph/crash/2022-12-09T09:04:37.157002Z_d160769c-e2cc-4222-8e44-d12fb9c295d8 as client.crash.epyc01 failed: (None, b"2022-12-12T05:50:19.669+0000 7f78e48d8700 -1 Infiniband verify_prereq!!! WARNING !!! For RDMA to work properly user memlock (ulimit -l) must be big enough to allow large amount of registered memory. We recommend setting this parameter to infinity\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h: In function 'DeviceList::DeviceList(ceph::common::CephContext*)' thread 7f78ceffd700 time 2022-12-12T05:50:19.696150+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h: 114: FAILED ceph_assert(num)\n ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)\n 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f78e955f43f]\n 2: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7f78e955f605]\n 3: (DeviceList::DeviceList(ceph::common::CephContext*)+0x131) [0x7f78e990a4b1]\n 4: (Infiniband::init()+0xa9) [0x7f78e9908239]\n 5: (RDMAWorker::connect(entity_addr_t const&, SocketOptions const&, ConnectedSocket*)+0x41) [0x7f78e991e201]\n 6: (AsyncConnection::process()+0x333) [0x7f78e98912e3]\n 7: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa74) [0x7f78e98ed6d4]\n 8: /usr/lib64/ceph/libceph-common.so.2(+0x5fefa6) [0x7f78e98f4fa6]\n 9: /lib64/libstdc++.so.6(+0xc2ba3) [0x7f78e82deba3]\n 10: /lib64/libpthread.so.0(+0x81ca) [0x7f78eca771ca]\n 11: clone()\ntimeout: the monitored command dumped core\n") Dec 12 14:50:20 192 bash[486865]: level=error ts=2022-12-12T05:50:20.114Z caller=dispatch.go:354 component=dispatcher msg="Notify for alerts failed" num_alerts=2 err="ceph-dashboard/webhook[0]: notify retry canceled after 7 attempts: Post \"https://epyc01:8443/api/prometheus_receiver\": dial tcp 127.0.1.1:8443: connect: connection refused; ceph-dashboard/webhook[1]: notify retry canceled after 7 attempts: Post \" https://192.168.100.13:8443/api/prometheus_receiver\": dial tcp 192.168.100.13:8443: connect: connection refused" Dec 12 14:50:20 192 bash[486865]: level=warn ts=2022-12-12T05:50:20.115Z caller=notify.go:724 component=dispatcher receiver=ceph-dashboard integration=webhook[1] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://192.168.100.13:8443/api/prometheus_receiver\": dial tcp 192.168.100.13:8443: connect: connection refused" Dec 12 14:50:20 192 bash[486865]: level=warn ts=2022-12-12T05:50:20.115Z caller=notify.go:724 component=dispatcher receiver=ceph-dashboard integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://epyc01:8443/api/prometheus_receiver\": dial tcp 127.0.1.1:8443: connect: connection refused" Dec 12 14:50:20 192 bash[486904]: WARNING:ceph-crash:post /var/lib/ceph/crash/2022-12-09T09:04:17.768925Z_097b4858-0e4a-453b-ab26-fc1b3f3bc0f3 as client.crash.epyc01 failed: (None, b"2022-12-12T05:50:20.073+0000 7f75dc246700 -1 Infiniband verify_prereq!!! WARNING !!! For RDMA to work properly user memlock (ulimit -l) must be big enough to allow large amount of registered memory. We recommend setting this parameter to infinity\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h: In function 'DeviceList::DeviceList(ceph::common::CephContext*)' thread 7f75ca7fc700 time 2022-12-12T05:50:20.102014+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h: 114: FAILED ceph_assert(num)\n ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)\n 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f75e0ecd43f]\n 2: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7f75e0ecd605]\n 3: (DeviceList::DeviceList(ceph::common::CephContext*)+0x131) [0x7f75e12784b1]\n 4: (Infiniband::init()+0xa9) [0x7f75e1276239]\n 5: (RDMAWorker::connect(entity_addr_t const&, SocketOptions const&, ConnectedSocket*)+0x41) [0x7f75e128c201]\n 6: (AsyncConnection::process()+0x333) [0x7f75e11ff2e3]\n 7: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa74) [0x7f75e125b6d4]\n 8: /usr/lib64/ceph/libceph-common.so.2(+0x5fefa6) [0x7f75e1262fa6]\n 9: /lib64/libstdc++.so.6(+0xc2ba3) [0x7f75dfc4cba3]\n 10: /lib64/libpthread.so.0(+0x81ca) [0x7f75e43e51ca]\n 11: clone()\ntimeout: the monitored command dumped core\n") Dec 12 14:50:20 192 bash[486904]: WARNING:ceph-crash:post /var/lib/ceph/crash/2022-12-09T09:05:15.560611Z_c647c8b7-e335-4026-aee9-007e4745e5b9 as client.crash.epyc01 failed: (None, b"2022-12-12T05:50:20.481+0000 7f59f5689700 -1 Infiniband verify_prereq!!! WARNING !!! For RDMA to work properly user memlock (ulimit -l) must be big enough to allow large amount of registered memory. We recommend setting this parameter to infinity\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h: In function 'DeviceList::DeviceList(ceph::common::CephContext*)' thread 7f59d77fe700 time 2022-12-12T05:50:20.508585+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h: 114: FAILED ceph_assert(num)\n ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)\n 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f59fa31043f]\n 2: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7f59fa310605]\n 3: (DeviceList::DeviceList(ceph::common::CephContext*)+0x131) [0x7f59fa6bb4b1]\n 4: (Infiniband::init()+0xa9) [0x7f59fa6b9239]\n 5: (RDMAWorker::connect(entity_addr_t const&, SocketOptions const&, ConnectedSocket*)+0x41) [0x7f59fa6cf201]\n 6: (AsyncConnection::process()+0x333) [0x7f59fa6422e3]\n 7: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa74) [0x7f59fa69e6d4]\n 8: /usr/lib64/ceph/libceph-common.so.2(+0x5fefa6) [0x7f59fa6a5fa6]\n 9: /lib64/libstdc++.so.6(+0xc2ba3) [0x7f59f908fba3]\n 10: /lib64/libpthread.so.0(+0x81ca) [0x7f59fd8281ca]\n 11: clone()\ntimeout: the monitored command dumped core\n") -- I cannot confirm the errors at /var/log/ceph/ceph-osd and /var/log/ceph/ceph-vlolume. If you want to see other logs, please tell me the place. Regards, -- Mitsumasa KONDO 2022年12月11日(日) 21:51 Serkan KARCI <karciserkan@xxxxxxxxx>: 2022年12月11日(日) 21:51 Serkan KARCI <karciserkan@xxxxxxxxx>: > Hello Mitsumasa, > > Could you please share your ceph.conf and ceph logs ? we manage to make it > work as public wide except cinder service which we are working on to > resolve. > > Best Regards, > Serkan > > > > Mitsumasa KONDO <kondo.mitsumasa@xxxxxxxxx>, 9 Ara 2022 Cum, 12:36 > tarihinde şunu yazdı: > >> Hi, >> >> I try to set rdma setting in Ceph cluster. But I set config, it's >> stucked... >> >> # ceph --version >> ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy >> (stable) >> # ceph config set global ms_type async+rdma >> # ceph -s >> 2022-12-09T17:53:04.954+0900 7f85b55b7700 -1 Infiniband verify_prereq!!! >> WARNING !!! For RDMA to work properly user memlock (ulimit -l) must be big >> enough to allow large amount of registered memory. We recommend setting >> this parameter to infinity >> [[[stucked]]] >> >> I confirmed RDMA function in my network by rping. And I >> set memlock unlimited setting in /etc/security/limits.conf. What should I >> do? >> >> Regards, >> -- >> Mitsumasa KONDO >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx