Hello, we've found problem: In systemd unit for OSD there is missing this line in the [Service] section: LimitMEMLOCK=infinity When I added this line to systemd unit, OSD daemon started and we have HEALTH_OK state in the cluster status. Sincerely Jan Marek Dne Po, úno 05, 2024 at 11:10:21 CET napsal(a) Jan Marek: > Hello, > > we are configure new ceph cluster with Mellanox 2x100Gbps cards. > > We bond this two ports to MLAG bond0 interface. > > In the async+posix mode everythink is OK, cluster is in the > HELTH_OK state. > > CEPH version is 18.2.1. > > Then we tried to configure RoCE for cluster part of network, but > without success. > > Our ceph config dump (only relevant config): > > global advanced ms_async_rdma_device_name mlx5_bond_0 * > global advanced ms_async_rdma_gid_idx 3 > global host:ceph1-nvme advanced ms_async_rdma_local_gid 0000:0000:0000:0000:0000:ffff:a0d9:05d8 * > global host:ceph2-nvme advanced ms_async_rdma_local_gid 0000:0000:0000:0000:0000:ffff:a0d9:05d7 * > global host:ceph3-nvme advanced ms_async_rdma_local_gid 0000:0000:0000:0000:0000:ffff:a0d9:05d6 * > global advanced ms_async_rdma_roce_ver 2 > global advanced ms_async_rdma_type rdma * > global advanced ms_cluster_type async+rdma * > global advanced ms_public_type async+posix * > > > On the ceph1-nvme there is this show_gids.sh list: > > # ./show_gids.sh > DEV PORT INDEX GID IPv4 VER DEV > --- ---- ----- --- ------------ --- --- > mlx5_bond_0 1 0 fe80:0000:0000:0000:0e42:a1ff:fe93:b004 v1 bond0 > mlx5_bond_0 1 1 fe80:0000:0000:0000:0e42:a1ff:fe93:b004 v2 bond0 > mlx5_bond_0 1 2 0000:0000:0000:0000:0000:ffff:a0d9:05d8 160.217.5.216 v1 bond0 > mlx5_bond_0 1 3 0000:0000:0000:0000:0000:ffff:a0d9:05d8 160.217.5.216 v2 bond0 > n_gids_found=4 > > I have set this line in /etc/security/limits.conf: > > * hard memlock unlimited > > But when I tried to restart ceph.target, OSD nodes didn't start > with this errors, see attachment. > > Mellanox drivers are from Debian bookworm kernel. > > Is there somethink missing in config, or some errors? > > When I change ms_cluster_type to async+posix and restart > ceph.target, cluster converged to HEALTH_OK state... > > Thanks for advices... > > Sincerely > Jan Marek > -- > Ing. Jan Marek > University of South Bohemia > Academic Computer Centre > Phone: +420389032080 > http://www.gnu.org/philosophy/no-word-attachments.cs.html > 2024-02-05T09:56:50.249344+01:00 ceph3-nvme ceph-osd[21139]: auth: KeyRing::load: loaded key file /var/lib/ceph/osd/ceph-2/keyring > 2024-02-05T09:56:50.249362+01:00 ceph3-nvme ceph-osd[21139]: auth: KeyRing::load: loaded key file /var/lib/ceph/osd/ceph-2/keyring > 2024-02-05T09:56:50.249377+01:00 ceph3-nvme ceph-osd[21139]: asok(0x559592978000) register_command rotate-key hook 0x7fffcc26d398 > 2024-02-05T09:56:50.249391+01:00 ceph3-nvme ceph-osd[21139]: log_channel(cluster) update_config to_monitors: true to_syslog: false syslog_facility: prio: info to_graylog: false graylog_host: 127.0.0.1 graylog_port: 12201) > 2024-02-05T09:56:50.249409+01:00 ceph3-nvme ceph-osd[21139]: osd.2 109 log_to_monitors true > 2024-02-05T09:56:50.249424+01:00 ceph3-nvme ceph-osd[21139]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/msg/async/rdma/Infiniband.cc: In function 'void Infiniband::init()' thread 7fb3bb042700 time 2024-02-05T08:56:50.142198+0000#012/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/msg/async/rdma/Infiniband.cc: 1061: FAILED ceph_assert(device)#012#012 ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)#012 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x55958fb905b3]#012 2: /usr/bin/ceph-osd(+0x62d779) [0x55958fb90779]#012 3: (Infiniband::init()+0x95b) [0x55959077df0b]#012 4: (RDMAWorker::listen(entity_addr_t&, unsigned int, SocketOptions const&, ServerSocket*)+0x30) [0x55959053bba0]#012 5: /usr/bin/ceph-osd(+0xfbb03f) [0x55959051e03f]#012 6: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa64) [0x55959052f4c4]#012 7: /usr/bin/ceph-osd(+0xfd12c6) [0x5595905342c6]#012 8: /lib64/libstdc++.so.6(+0xc2b23) [0x7fb3c1adcb23]#012 9: /lib64/libpthread.so.0(+0x81ca) [0x7fb3c248e1ca]#012 10: clone() > 2024-02-05T09:56:50.249447+01:00 ceph3-nvme ceph-osd[21139]: *** Caught signal (Aborted) **#012 in thread 7fb3bb042700 thread_name:msgr-worker-0#012#012 ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)#012 1: /lib64/libpthread.so.0(+0x12d20) [0x7fb3c2498d20]#012 2: gsignal()#012 3: abort()#012 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x55958fb9060d]#012 5: /usr/bin/ceph-osd(+0x62d779) [0x55958fb90779]#012 6: (Infiniband::init()+0x95b) [0x55959077df0b]#012 7: (RDMAWorker::listen(entity_addr_t&, unsigned int, SocketOptions const&, ServerSocket*)+0x30) [0x55959053bba0]#012 8: /usr/bin/ceph-osd(+0xfbb03f) [0x55959051e03f]#012 9: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa64) [0x55959052f4c4]#012 10: /usr/bin/ceph-osd(+0xfd12c6) [0x5595905342c6]#012 11: /lib64/libstdc++.so.6(+0xc2b23) [0x7fb3c1adcb23]#012 12: /lib64/libpthread.so.0(+0x81ca) [0x7fb3c248e1ca]#012 13: clone()#012 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > 2024-02-05T09:56:50.249475+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: -2717> 2024-02-05T08:56:50.132+0000 7fb3c49da640 -1 osd.2 109 log_to_monitors true > 2024-02-05T09:56:50.249503+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: -2716> 2024-02-05T08:56:50.140+0000 7fb3bb042700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/msg/async/rdma/Infiniband.cc: In function 'void Infiniband::init()' thread 7fb3bb042700 time 2024-02-05T08:56:50.142198+0000 > 2024-02-05T09:56:50.249530+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.1/rpm/el8/BUILD/ceph-18.2.1/src/msg/async/rdma/Infiniband.cc: 1061: FAILED ceph_assert(device) > 2024-02-05T09:56:50.249551+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: > 2024-02-05T09:56:50.249571+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable) > 2024-02-05T09:56:50.249593+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x55958fb905b3] > 2024-02-05T09:56:50.249614+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 2: /usr/bin/ceph-osd(+0x62d779) [0x55958fb90779] > 2024-02-05T09:56:50.249634+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 3: (Infiniband::init()+0x95b) [0x55959077df0b] > 2024-02-05T09:56:50.249656+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 4: (RDMAWorker::listen(entity_addr_t&, unsigned int, SocketOptions const&, ServerSocket*)+0x30) [0x55959053bba0] > 2024-02-05T09:56:50.249678+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 5: /usr/bin/ceph-osd(+0xfbb03f) [0x55959051e03f] > 2024-02-05T09:56:50.249699+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 6: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa64) [0x55959052f4c4] > 2024-02-05T09:56:50.249721+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 7: /usr/bin/ceph-osd(+0xfd12c6) [0x5595905342c6] > 2024-02-05T09:56:50.249741+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 8: /lib64/libstdc++.so.6(+0xc2b23) [0x7fb3c1adcb23] > 2024-02-05T09:56:50.249761+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 9: /lib64/libpthread.so.0(+0x81ca) [0x7fb3c248e1ca] > 2024-02-05T09:56:50.249784+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 10: clone() > 2024-02-05T09:56:50.249804+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: > 2024-02-05T09:56:50.249824+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: -2715> 2024-02-05T08:56:50.168+0000 7fb3bb042700 -1 *** Caught signal (Aborted) ** > 2024-02-05T09:56:50.249843+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: in thread 7fb3bb042700 thread_name:msgr-worker-0 > 2024-02-05T09:56:50.249862+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: > 2024-02-05T09:56:50.249881+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable) > 2024-02-05T09:56:50.249901+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 1: /lib64/libpthread.so.0(+0x12d20) [0x7fb3c2498d20] > 2024-02-05T09:56:50.249920+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 2: gsignal() > 2024-02-05T09:56:50.249940+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 3: abort() > 2024-02-05T09:56:50.249959+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x55958fb9060d] > 2024-02-05T09:56:50.249982+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 5: /usr/bin/ceph-osd(+0x62d779) [0x55958fb90779] > 2024-02-05T09:56:50.250002+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 6: (Infiniband::init()+0x95b) [0x55959077df0b] > 2024-02-05T09:56:50.250021+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 7: (RDMAWorker::listen(entity_addr_t&, unsigned int, SocketOptions const&, ServerSocket*)+0x30) [0x55959053bba0] > 2024-02-05T09:56:50.250042+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 8: /usr/bin/ceph-osd(+0xfbb03f) [0x55959051e03f] > 2024-02-05T09:56:50.250066+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 9: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa64) [0x55959052f4c4] > 2024-02-05T09:56:50.250090+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 10: /usr/bin/ceph-osd(+0xfd12c6) [0x5595905342c6] > 2024-02-05T09:56:50.250110+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 11: /lib64/libstdc++.so.6(+0xc2b23) [0x7fb3c1adcb23] > 2024-02-05T09:56:50.250132+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 12: /lib64/libpthread.so.0(+0x81ca) [0x7fb3c248e1ca] > 2024-02-05T09:56:50.250154+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: 13: clone() > 2024-02-05T09:56:50.250174+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > 2024-02-05T09:56:50.250196+01:00 ceph3-nvme ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2[21135]: > 2024-02-05T09:56:50.332927+01:00 ceph3-nvme podman[21994]: 2024-02-05 09:56:50.332706257 +0100 CET m=+0.015245010 container died 7b35d0bcf4e2c6af121e5973c4fec96940f5626f3eb7530d560d83bc14a7fea4 (image=quay.io/ceph/ceph@sha256:a4e86c750cc11a8c93453ef5682acfa543e3ca08410efefa30f520b54f41831f, name=ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2, org.label-schema.vendor=CentOS, CEPH_POINT_RELEASE=-18.2.1, GIT_CLEAN=True, maintainer=Guillaume Abrioux <gabrioux@xxxxxxxxxx>, org.label-schema.name=CentOS Stream 8 Base Image, GIT_COMMIT=4e397bbc7ff93e76025ef390087dfcea05ef676e, org.label-schema.schema-version=1.0, org.label-schema.build-date=20240102, org.label-schema.license=GPLv2, GIT_BRANCH=HEAD, RELEASE=HEAD, io.buildah.version=1.29.1, GIT_REPO=https://github.com/ceph/ceph-container.git, ceph=True) > 2024-02-05T09:56:50.337613+01:00 ceph3-nvme systemd[1]: var-lib-containers-storage-overlay-3df2065cca86c325ca5c08e2c4dd88773c177351b5ff65b21e53ace6e5b2fecb-merged.mount: Deactivated successfully. > 2024-02-05T09:56:50.341278+01:00 ceph3-nvme podman[21994]: 2024-02-05 09:56:50.341208686 +0100 CET m=+0.023747429 container remove 7b35d0bcf4e2c6af121e5973c4fec96940f5626f3eb7530d560d83bc14a7fea4 (image=quay.io/ceph/ceph@sha256:a4e86c750cc11a8c93453ef5682acfa543e3ca08410efefa30f520b54f41831f, name=ceph-87483e28-c19a-11ee-90ed-0c42a193b004-osd-2, ceph=True, org.label-schema.schema-version=1.0, io.buildah.version=1.29.1, GIT_REPO=https://github.com/ceph/ceph-container.git, GIT_BRANCH=HEAD, GIT_COMMIT=4e397bbc7ff93e76025ef390087dfcea05ef676e, maintainer=Guillaume Abrioux <gabrioux@xxxxxxxxxx>, org.label-schema.license=GPLv2, org.label-schema.vendor=CentOS, org.label-schema.build-date=20240102, org.label-schema.name=CentOS Stream 8 Base Image, GIT_CLEAN=True, RELEASE=HEAD, CEPH_POINT_RELEASE=-18.2.1) > 2024-02-05T09:56:50.343181+01:00 ceph3-nvme systemd[1]: ceph-87483e28-c19a-11ee-90ed-0c42a193b004@osd.2.service: Main process exited, code=exited, status=139/n/a > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx -- Ing. Jan Marek University of South Bohemia Academic Computer Centre Phone: +420389032080 http://www.gnu.org/philosophy/no-word-attachments.cs.html
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx