Re: Set async+rdma in Ceph cluster, then stuck

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Serkan,

Thanks for your reply.

-- Server setting --
OS: Ubuntu 20.04LTS
NIC: Mellanox ConnectX-6 EN
Driver: MLNX_OFED_LINUX-5.6-2.0.9.0-ubuntu20.04-x86_64
--

My ceph.conf is under following,
--  ceph.conf  --
# minimal ceph.conf for 2f383ac8-76cb-11ed-bfbc-6dd8bf17bdf9
[global]
        fsid = 2f383ac8-76cb-11ed-bfbc-6dd8bf17bdf9
        mon_host = [v2:192.168.100.11:3300/0,v1:192.168.100.11:6789/0] [v2:
192.168.100.12:3300/0,v1:192.168.100.12:6789/0] [v2:
192.168.100.13:3300/0,v1:192.168.100.13:6789/0] [v2:
192.168.100.14:3300/0,v1:192.168.100.14:6789/0] [v2:
192.168.100.16:3300/0,v1:192.168.100.16:6789/0]

        ms_type = async+rdma
        ms_cluster_type = async+rdma
        ms_public_type = async+rdma
        ms_async_rdma_device_name = mlx5_0
        ms_async_rdma_polling_us = 0
---

Ceph log is here,
-- /var/log/syslog --
Dec 12 14:50:10 192 bash[486865]: level=error ts=2022-12-12T05:50:10.114Z
caller=dispatch.go:354 component=dispatcher msg="Notify for alerts failed"
num_alerts=2 err="ceph-dashboard/webhook[0]: notify retry canceled after 8
attempts: Post \"https://epyc01:8443/api/prometheus_receiver\": dial tcp
127.0.1.1:8443: connect: connection refused; ceph-dashboard/webhook[1]:
notify retry canceled after 8 attempts: Post \"
https://192.168.100.13:8443/api/prometheus_receiver\": dial tcp
192.168.100.13:8443: connect: connection refused"
Dec 12 14:50:10 192 bash[486865]: level=warn ts=2022-12-12T05:50:10.115Z
caller=notify.go:724 component=dispatcher receiver=ceph-dashboard
integration=webhook[0] msg="Notify attempt failed, will retry later"
attempts=1 err="Post \"https://epyc01:8443/api/prometheus_receiver\": dial
tcp 127.0.1.1:8443: connect: connection refused"
Dec 12 14:50:10 192 bash[486865]: level=warn ts=2022-12-12T05:50:10.115Z
caller=notify.go:724 component=dispatcher receiver=ceph-dashboard
integration=webhook[1] msg="Notify attempt failed, will retry later"
attempts=1 err="Post \"https://192.168.100.13:8443/api/prometheus_receiver\":
dial tcp 192.168.100.13:8443: connect: connection refused"
Dec 12 14:50:10 192 bash[486864]: debug 2022-12-12T05:50:10.761+0000
7fa9e4174700  1 mon.epyc01@0(leader).osd e322 _set_new_cache_sizes
cache_size:1020054731 inc_alloc: 348127232 full_alloc: 348127232 kv_alloc:
322961408
Dec 12 14:50:15 192 bash[486864]: debug 2022-12-12T05:50:15.761+0000
7fa9e4174700  1 mon.epyc01@0(leader).osd e322 _set_new_cache_sizes
cache_size:1020054731 inc_alloc: 348127232 full_alloc: 348127232 kv_alloc:
322961408
Dec 12 14:50:19 192 bash[486904]: WARNING:ceph-crash:post
/var/lib/ceph/crash/2022-12-09T09:04:37.157002Z_d160769c-e2cc-4222-8e44-d12fb9c295d8
as client.crash.epyc01 failed: (None, b"2022-12-12T05:50:19.669+0000
7f78e48d8700 -1 Infiniband verify_prereq!!! WARNING !!! For RDMA to work
properly user memlock (ulimit -l) must be big enough to allow large amount
of registered memory. We recommend setting this parameter to
infinity\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h:
In function 'DeviceList::DeviceList(ceph::common::CephContext*)' thread
7f78ceffd700 time
2022-12-12T05:50:19.696150+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h:
114: FAILED ceph_assert(num)\n ceph version 17.2.5
(98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)\n 1:
(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x135) [0x7f78e955f43f]\n 2:
/usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7f78e955f605]\n 3:
(DeviceList::DeviceList(ceph::common::CephContext*)+0x131)
[0x7f78e990a4b1]\n
4: (Infiniband::init()+0xa9) [0x7f78e9908239]\n 5:
(RDMAWorker::connect(entity_addr_t const&, SocketOptions const&,
ConnectedSocket*)+0x41) [0x7f78e991e201]\n 6:
(AsyncConnection::process()+0x333) [0x7f78e98912e3]\n 7:
(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned
long, std::ratio<1l, 1000000000l> >*)+0xa74) [0x7f78e98ed6d4]\n 8:
/usr/lib64/ceph/libceph-common.so.2(+0x5fefa6) [0x7f78e98f4fa6]\n 9:
/lib64/libstdc++.so.6(+0xc2ba3) [0x7f78e82deba3]\n 10:
/lib64/libpthread.so.0(+0x81ca) [0x7f78eca771ca]\n 11: clone()\ntimeout:
the monitored command dumped core\n")
Dec 12 14:50:20 192 bash[486865]: level=error ts=2022-12-12T05:50:20.114Z
caller=dispatch.go:354 component=dispatcher msg="Notify for alerts failed"
num_alerts=2 err="ceph-dashboard/webhook[0]: notify retry canceled after 7
attempts: Post \"https://epyc01:8443/api/prometheus_receiver\": dial tcp
127.0.1.1:8443: connect: connection refused; ceph-dashboard/webhook[1]:
notify retry canceled after 7 attempts: Post \"
https://192.168.100.13:8443/api/prometheus_receiver\": dial tcp
192.168.100.13:8443: connect: connection refused"
Dec 12 14:50:20 192 bash[486865]: level=warn ts=2022-12-12T05:50:20.115Z
caller=notify.go:724 component=dispatcher receiver=ceph-dashboard
integration=webhook[1] msg="Notify attempt failed, will retry later"
attempts=1 err="Post \"https://192.168.100.13:8443/api/prometheus_receiver\":
dial tcp 192.168.100.13:8443: connect: connection refused"
Dec 12 14:50:20 192 bash[486865]: level=warn ts=2022-12-12T05:50:20.115Z
caller=notify.go:724 component=dispatcher receiver=ceph-dashboard
integration=webhook[0] msg="Notify attempt failed, will retry later"
attempts=1 err="Post \"https://epyc01:8443/api/prometheus_receiver\": dial
tcp 127.0.1.1:8443: connect: connection refused"
Dec 12 14:50:20 192 bash[486904]: WARNING:ceph-crash:post
/var/lib/ceph/crash/2022-12-09T09:04:17.768925Z_097b4858-0e4a-453b-ab26-fc1b3f3bc0f3
as client.crash.epyc01 failed: (None, b"2022-12-12T05:50:20.073+0000
7f75dc246700 -1 Infiniband verify_prereq!!! WARNING !!! For RDMA to work
properly user memlock (ulimit -l) must be big enough to allow large amount
of registered memory. We recommend setting this parameter to
infinity\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h:
In function 'DeviceList::DeviceList(ceph::common::CephContext*)' thread
7f75ca7fc700 time
2022-12-12T05:50:20.102014+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h:
114: FAILED ceph_assert(num)\n ceph version 17.2.5
(98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)\n 1:
(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x135) [0x7f75e0ecd43f]\n 2:
/usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7f75e0ecd605]\n 3:
(DeviceList::DeviceList(ceph::common::CephContext*)+0x131)
[0x7f75e12784b1]\n
4: (Infiniband::init()+0xa9) [0x7f75e1276239]\n 5:
(RDMAWorker::connect(entity_addr_t const&, SocketOptions const&,
ConnectedSocket*)+0x41) [0x7f75e128c201]\n 6:
(AsyncConnection::process()+0x333) [0x7f75e11ff2e3]\n 7:
(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned
long, std::ratio<1l, 1000000000l> >*)+0xa74) [0x7f75e125b6d4]\n 8:
/usr/lib64/ceph/libceph-common.so.2(+0x5fefa6) [0x7f75e1262fa6]\n 9:
/lib64/libstdc++.so.6(+0xc2ba3) [0x7f75dfc4cba3]\n 10:
/lib64/libpthread.so.0(+0x81ca) [0x7f75e43e51ca]\n 11: clone()\ntimeout:
the monitored command dumped core\n")
Dec 12 14:50:20 192 bash[486904]: WARNING:ceph-crash:post
/var/lib/ceph/crash/2022-12-09T09:05:15.560611Z_c647c8b7-e335-4026-aee9-007e4745e5b9
as client.crash.epyc01 failed: (None, b"2022-12-12T05:50:20.481+0000
7f59f5689700 -1 Infiniband verify_prereq!!! WARNING !!! For RDMA to work
properly user memlock (ulimit -l) must be big enough to allow large amount
of registered memory. We recommend setting this parameter to
infinity\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h:
In function 'DeviceList::DeviceList(ceph::common::CephContext*)' thread
7f59d77fe700 time
2022-12-12T05:50:20.508585+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/msg/async/rdma/Infiniband.h:
114: FAILED ceph_assert(num)\n ceph version 17.2.5
(98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)\n 1:
(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x135) [0x7f59fa31043f]\n 2:
/usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7f59fa310605]\n 3:
(DeviceList::DeviceList(ceph::common::CephContext*)+0x131)
[0x7f59fa6bb4b1]\n
4: (Infiniband::init()+0xa9) [0x7f59fa6b9239]\n 5:
(RDMAWorker::connect(entity_addr_t const&, SocketOptions const&,
ConnectedSocket*)+0x41) [0x7f59fa6cf201]\n 6:
(AsyncConnection::process()+0x333) [0x7f59fa6422e3]\n 7:
(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned
long, std::ratio<1l, 1000000000l> >*)+0xa74) [0x7f59fa69e6d4]\n 8:
/usr/lib64/ceph/libceph-common.so.2(+0x5fefa6) [0x7f59fa6a5fa6]\n 9:
/lib64/libstdc++.so.6(+0xc2ba3) [0x7f59f908fba3]\n 10:
/lib64/libpthread.so.0(+0x81ca) [0x7f59fd8281ca]\n 11: clone()\ntimeout:
the monitored command dumped core\n")
--

I cannot confirm the errors at /var/log/ceph/ceph-osd and
/var/log/ceph/ceph-vlolume.
If you want to see other logs, please tell me the place.

Regards,
--
Mitsumasa KONDO

2022年12月11日(日) 21:51 Serkan KARCI <karciserkan@xxxxxxxxx>:

2022年12月11日(日) 21:51 Serkan KARCI <karciserkan@xxxxxxxxx>:

> Hello Mitsumasa,
>
> Could you please share your ceph.conf and ceph logs ? we manage to make it
> work as public wide except cinder service which we are working on to
> resolve.
>
> Best Regards,
> Serkan
>
>
>
> Mitsumasa KONDO <kondo.mitsumasa@xxxxxxxxx>, 9 Ara 2022 Cum, 12:36
> tarihinde şunu yazdı:
>
>> Hi,
>>
>> I try to set rdma setting in Ceph cluster. But I set config, it's
>> stucked...
>>
>> # ceph --version
>> ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy
>> (stable)
>> # ceph config set global ms_type async+rdma
>> # ceph -s
>> 2022-12-09T17:53:04.954+0900 7f85b55b7700 -1 Infiniband verify_prereq!!!
>> WARNING !!! For RDMA to work properly user memlock (ulimit -l) must be big
>> enough to allow large amount of registered memory. We recommend setting
>> this parameter to infinity
>> [[[stucked]]]
>>
>> I confirmed RDMA function in my network by rping. And I
>> set memlock unlimited setting in /etc/security/limits.conf. What should I
>> do?
>>
>> Regards,
>> --
>> Mitsumasa KONDO
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux