Hello, we have sometimes problem with OSD process in the "weird" state - its "flapping" between dead and health state. CEPH cluster is in the version 18.2.2, installed by 'cephadm' bootstrap process. Cluster hosts uses for internal communication RoCE rdma: # ceph config dump: global advanced ms_cluster_type async+rdma * global advanced ms_public_type async+posix * Our netwok cards are: 43:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] 43:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex] (2x100Gbps) We have all-flash NVMe configuration. When some OSD goes to the "half-dead" state, we have in our syslog this messages: 2024-12-10T00:56:55.025681+01:00 c1-dc1 ceph-osd[3346501]: RDMAStack handle_async_event QP not dead, event associate qp number: 80235 Queue Pair status: IBV_QPS_ERR Event : last WQE reached 2024-12-10T00:56:55.025777+01:00 c1-dc1 ceph-89ec5e54-dba6-11ee-9c3e-72791a95392b-osd-1[3346299]: 2024-12-09T23:56:55.018+0000 7f5b5950c700 -1 RDMAStack handle_async_event QP not dead, event associate qp number: 80235 Queue Pair status: IBV_QPS_ERR Event : last WQE reached We are using podman as a container solution. When we do restart this problematic container, its start and everythink goes smoothly. If we don't restart this container, in log we have a lot of messages, that some OSD maked this OSD as dead, but this OSD complaints about it and cluster goes to the problematic state. We are using this CEPH cluster as a storage for ProxMox virtualization and some virtuals don't "survive" this situation, as they have not accessible theire "disks" :-(. Is there a some solution, which we can a try? Many thanks for advices. Sincerely Jan Marek -- Ing. Jan Marek University of South Bohemia Academic Computer Centre Phone: +420389032080 http://www.gnu.org/philosophy/no-word-attachments.cs.html
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx