Hello, we are configure new ceph cluster with Mellanox 2x100Gbps cards. We bond this two ports to MLAG bond0 interface. In the async+posix mode everythink is OK, cluster is in the HELTH_OK state. CEPH version is 18.2.1. Then we tried to configure RoCE for cluster part of network, but without success. Our ceph config dump (only relevant config): global advanced ms_async_rdma_device_name mlx5_bond_0 * global advanced ms_async_rdma_gid_idx 3 global host:ceph1-nvme advanced ms_async_rdma_local_gid 0000:0000:0000:0000:0000:ffff:a0d9:05d8 * global host:ceph2-nvme advanced ms_async_rdma_local_gid 0000:0000:0000:0000:0000:ffff:a0d9:05d7 * global host:ceph3-nvme advanced ms_async_rdma_local_gid 0000:0000:0000:0000:0000:ffff:a0d9:05d6 * global advanced ms_async_rdma_roce_ver 2 global advanced ms_async_rdma_type rdma * global advanced ms_cluster_type async+rdma * global advanced ms_public_type async+posix * On the ceph1-nvme there is this show_gids.sh list: # ./show_gids.sh DEV PORT INDEX GID IPv4 VER DEV --- ---- ----- --- ------------ --- --- mlx5_bond_0 1 0 fe80:0000:0000:0000:0e42:a1ff:fe93:b004 v1 bond0 mlx5_bond_0 1 1 fe80:0000:0000:0000:0e42:a1ff:fe93:b004 v2 bond0 mlx5_bond_0 1 2 0000:0000:0000:0000:0000:ffff:a0d9:05d8 160.217.5.216 v1 bond0 mlx5_bond_0 1 3 0000:0000:0000:0000:0000:ffff:a0d9:05d8 160.217.5.216 v2 bond0 n_gids_found=4 I have set this line in /etc/security/limits.conf: * hard memlock unlimited But when I tried to restart ceph.target, OSD nodes didn't start with this errors, see attachment. Mellanox drivers are from Debian bookworm kernel. Is there somethink missing in config, or some errors? When I change ms_cluster_type to async+posix and restart ceph.target, cluster converged to HEALTH_OK state... Thanks for advices... Sincerely Jan Marek -- Ing. Jan Marek University of South Bohemia Academic Computer Centre Phone: +420389032080 http://www.gnu.org/philosophy/no-word-attachments.cs.html
2024-02-05T09:56:50.249344+01:00 ceph3-nvme ceph-osd[21139]: auth: KeyRing::load: loaded key file /var/lib/ceph/osd/ceph-2/keyring 2024-02-05T09:56:50.249362+01:00 ceph3-nvme ceph-osd[21139]: auth: KeyRing::load: loaded key file /var/lib/ceph/osd/ceph-2/keyring 2024-02-05T09:56:50.249377+01:00 ceph3-nvme ceph-osd[21139]: asok(0x559592978000) register_command rotate-key hook 0x7fffcc26d398 2024-02-05T09:56:50.249391+01:00 ceph3-nvme ceph-osd[21139]: log_channel(cluster) update_config to_monitors: true to_syslog: false syslog_facility: prio: info to_graylog: false graylog_host: 127.0.0.1 graylog_port: 12201) 2024-02-05T09:56:50.249409+01:00 ceph3-nvme ceph-osd[21139]: osd.2 109 log_to_monitors true 2024-02-05T09:56:50.249424+01:00 ceph3-nvme ceph-osd[21139]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/msg/async/rdma/Infiniband.cc: In function 'void Infiniband::init()' thread 7fb3bb042700 time 2024-02-05T08:56:50.142198+0000#012/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/msg/async/rdma/Infiniband.cc: 1061: FAILED ceph_assert(device)#012#012 ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)#012 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x55958fb905b3]#012 2: /usr/bin/ceph-osd(+0x62d779) [0x55958fb90779]#012 3: (Infiniband::init()+0x95b) [0x55959077df0b]#012 4: (RDMAWorker::listen(entity_addr_t&, unsigned int, SocketOptions const&, ServerSocket*)+0x30) [0x55959053bba0]#012 5: /usr/bin/ceph-osd(+0xfbb03f) [0x55959051e03f]#012 6: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa64) [0x55959052f4c4]#012 7: /usr/bin/ceph-osd(+0xfd12c6) [0x5595905342c6]#012 8: /lib64/libstdc++.so.6(+0xc2b23) [0x7fb3c1adcb23]#012 9: /lib64/libpthread.so.0(+0x81ca) [0x7fb3c248e1ca]#012 10: clone() 2024-02-05T09:56:50.249447+01:00 ceph3-nvme ceph-osd[21139]: *** Caught signal (Aborted) **#012 in thread 7fb3bb042700 thread_name:msgr-worker-0#012#012 ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)#012 1: /lib64/libpthread.so.0(+0x12d20) [0x7fb3c2498d20]#012 2: gsignal()#012 3: abort()#012 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x55958fb9060d]#012 5: /usr/bin/ceph-osd(+0x62d779) [0x55958fb90779]#012 6: (Infiniband::init()+0x95b) [0x55959077df0b]#012 7: (RDMAWorker::listen(entity_addr_t&, unsigned int, SocketOptions const&, ServerSocket*)+0x30) [0x55959053bba0]#012 8: /usr/bin/ceph-osd(+0xfbb03f) [0x55959051e03f]#012 9: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa64) [0x55959052f4c4]#012 10: /usr/bin/ceph-osd(+0xfd12c6) [0x5595905342c6]#012 11: /lib64/libstdc++.so.6(+0xc2b23) [0x7fb3c1adcb23]#012 12: /lib64/libpthread.so.0(+0x81ca) [0x7fb3c248e1ca]#012 13: clone()#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2024-02-05T09:56:50.249475+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: -2717> 2024-02-05T08:56:50.132+0000 7fb3c49da640 -1 osd.2 109 log_to_monitors true 2024-02-05T09:56:50.249503+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: -2716> 2024-02-05T08:56:50.140+0000 7fb3bb042700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/msg/async/rdma/Infiniband.cc: In function 'void Infiniband::init()' thread 7fb3bb042700 time 2024-02-05T08:56:50.142198+0000 2024-02-05T09:56:50.249530+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/msg/async/rdma/Infiniband.cc: 1061: FAILED ceph_assert(device) 2024-02-05T09:56:50.249551+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 2024-02-05T09:56:50.249571+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable) 2024-02-05T09:56:50.249593+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x55958fb905b3] 2024-02-05T09:56:50.249614+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 2: /usr/bin/ceph-osd(+0x62d779) [0x55958fb90779] 2024-02-05T09:56:50.249634+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 3: (Infiniband::init()+0x95b) [0x55959077df0b] 2024-02-05T09:56:50.249656+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 4: (RDMAWorker::listen(entity_addr_t&, unsigned int, SocketOptions const&, ServerSocket*)+0x30) [0x55959053bba0] 2024-02-05T09:56:50.249678+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 5: /usr/bin/ceph-osd(+0xfbb03f) [0x55959051e03f] 2024-02-05T09:56:50.249699+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 6: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa64) [0x55959052f4c4] 2024-02-05T09:56:50.249721+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 7: /usr/bin/ceph-osd(+0xfd12c6) [0x5595905342c6] 2024-02-05T09:56:50.249741+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 8: /lib64/libstdc++.so.6(+0xc2b23) [0x7fb3c1adcb23] 2024-02-05T09:56:50.249761+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 9: /lib64/libpthread.so.0(+0x81ca) [0x7fb3c248e1ca] 2024-02-05T09:56:50.249784+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 10: clone() 2024-02-05T09:56:50.249804+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 2024-02-05T09:56:50.249824+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: -2715> 2024-02-05T08:56:50.168+0000 7fb3bb042700 -1 *** Caught signal (Aborted) ** 2024-02-05T09:56:50.249843+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: in thread 7fb3bb042700 thread_name:msgr-worker-0 2024-02-05T09:56:50.249862+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 2024-02-05T09:56:50.249881+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable) 2024-02-05T09:56:50.249901+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 1: /lib64/libpthread.so.0(+0x12d20) [0x7fb3c2498d20] 2024-02-05T09:56:50.249920+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 2: gsignal() 2024-02-05T09:56:50.249940+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 3: abort() 2024-02-05T09:56:50.249959+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x55958fb9060d] 2024-02-05T09:56:50.249982+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 5: /usr/bin/ceph-osd(+0x62d779) [0x55958fb90779] 2024-02-05T09:56:50.250002+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 6: (Infiniband::init()+0x95b) [0x55959077df0b] 2024-02-05T09:56:50.250021+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 7: (RDMAWorker::listen(entity_addr_t&, unsigned int, SocketOptions const&, ServerSocket*)+0x30) [0x55959053bba0] 2024-02-05T09:56:50.250042+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 8: /usr/bin/ceph-osd(+0xfbb03f) [0x55959051e03f] 2024-02-05T09:56:50.250066+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 9: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa64) [0x55959052f4c4] 2024-02-05T09:56:50.250090+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 10: /usr/bin/ceph-osd(+0xfd12c6) [0x5595905342c6] 2024-02-05T09:56:50.250110+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 11: /lib64/libstdc++.so.6(+0xc2b23) [0x7fb3c1adcb23] 2024-02-05T09:56:50.250132+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 12: /lib64/libpthread.so.0(+0x81ca) [0x7fb3c248e1ca] 2024-02-05T09:56:50.250154+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 13: clone() 2024-02-05T09:56:50.250174+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2024-02-05T09:56:50.250196+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 2024-02-05T09:56:50.332927+01:00 ceph3-nvme podman[21994]: 2024-02-05 09:56:50.332706257 +0100 CET m=+0.015245010 container died 7b35d0bcf4e2c6af121e5973c4fec96940f5626f3eb7530d560d83bc14a7fea4 (image=quay.io/ceph/ceph@sha256:a4e86c750cc11a8c93453ef5682acfa543e3ca08410efefa30f520b54f41831f, name=ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2, org.label-schema.vendor=CentOS, CEPH_POINT_RELEASE=-18.2.1, GIT_CLEAN=True, maintainer=Guillaume Abrioux <gabrioux@xxxxxxxxxx>, org.label-schema.name=CentOS Stream 8 Base Image, GIT_COMMIT=4e397bbc7ff93e76025ef390087dfcea05ef676e, org.label-schema.schema-version=1.0, org.label-schema.build-date=20240102, org.label-schema.license=GPLv2, GIT_BRANCH=HEAD, RELEASE=HEAD, io.buildah.version=1.29.1, GIT_REPO=https://github.com/ceph/ceph-container.git, ceph=True) 2024-02-05T09:56:50.337613+01:00 ceph3-nvme systemd[1]: var-lib-containers-storage-overlay-3df2065cca86c325ca5c08e2c4dd88773c177351b5ff65b21e53ace6e5b2fecb-merged.mount: Deactivated successfully. 2024-02-05T09:56:50.341278+01:00 ceph3-nvme podman[21994]: 2024-02-05 09:56:50.341208686 +0100 CET m=+0.023747429 container remove 7b35d0bcf4e2c6af121e5973c4fec96940f5626f3eb7530d560d83bc14a7fea4 (image=quay.io/ceph/ceph@sha256:a4e86c750cc11a8c93453ef5682acfa543e3ca08410efefa30f520b54f41831f, name=ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2, ceph=True, org.label-schema.schema-version=1.0, io.buildah.version=1.29.1, GIT_REPO=https://github.com/ceph/ceph-container.git, GIT_BRANCH=HEAD, GIT_COMMIT=4e397bbc7ff93e76025ef390087dfcea05ef676e, maintainer=Guillaume Abrioux <gabrioux@xxxxxxxxxx>, org.label-schema.license=GPLv2, org.label-schema.vendor=CentOS, org.label-schema.build-date=20240102, org.label-schema.name=CentOS Stream 8 Base Image, GIT_CLEAN=True, RELEASE=HEAD, CEPH_POINT_RELEASE=-18.2.1) 2024-02-05T09:56:50.343181+01:00 ceph3-nvme systemd[1]: ceph-87483e28-c19a-11ee-90ed-0c42a193b004@osd.2.service: Main process exited, code=exited, status=139/n/a
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx