Re: OSD process in the "weird" state

Malte Stroem <malte.stroem@xxxxxxxxx> · Sat, 14 Dec 2024 00:30:33 +0100

Hello Jan,

yes, try redeploying one of the faulty OSDs and have a look if it dies 
again.

So empty it, clean it up and redeploy it.

Best,
Malte

On 10.12.24 10:29, Jan Marek wrote:
Hello,

we have sometimes problem with OSD process in the "weird" state -
its "flapping" between dead and health state.

CEPH cluster is in the version 18.2.2, installed by 'cephadm'
bootstrap process.

Cluster hosts uses for internal communication RoCE rdma:

# ceph config dump:
global                   advanced  ms_cluster_type                        async+rdma                                                                                 *
global                   advanced  ms_public_type                         async+posix                                                                                *

Our netwok cards are:

43:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
43:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

(2x100Gbps)

We have all-flash NVMe configuration.

When some OSD goes to the "half-dead" state, we have in our
syslog this messages:

2024-12-10T00:56:55.025681+01:00 c1-dc1 ceph-osd[3346501]: RDMAStack handle_async_event QP not dead, event associate qp number: 80235 Queue Pair status: IBV_QPS_ERR Event : last WQE reached
2024-12-10T00:56:55.025777+01:00 c1-dc1 ceph-89ec5e54-dba6-11ee-9c3e-72791a95392b-osd-1[3346299]: 2024-12-09T23:56:55.018+0000 7f5b5950c700 -1 RDMAStack handle_async_event QP not dead, event associate qp number: 80235 Queue Pair status: IBV_QPS_ERR Event : last WQE reached

We are using podman as a container solution.

When we do restart this problematic container, its start and
everythink goes smoothly. If we don't restart this container, in
log we have a lot of messages, that some OSD maked this OSD as
dead, but this OSD complaints about it and cluster goes to the
problematic state. We are using this CEPH cluster as a storage
for ProxMox virtualization and some virtuals don't "survive" this
situation, as they have not accessible theire "disks" :-(.

Is there a some solution, which we can a try?

Many thanks for advices.

Sincerely
Jan Marek

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx