RBD images can't be mapped anymore

Daniele Rimoldi <daniele.rimoldi@xxxxxxxxx> · Mon, 15 Jul 2024 18:41:54 +0200

Dear All,

in the last few days we've been facing a strange problem with RBD mapping
in our 5 host cluster.

The cluster is running since 12 months and 2 weeks ago was updated from
Quincy to Reef with no problems.

Saturday, we decided to shut-down one of the 5 nodes di insert a test NVME
drive, we have failure domain set at host level. Before this operation OSD
flags were set (nodown, noout, norebalance... etc).

With one node down, the cluster continued working correctly with one
exception. Many RBD images mapped on various clients stopped working. This
happened across various type of clients so both our external Proxmox
cluster and our windows machines lost these RBD mapped devices.

After bringing back node 5 to the cluster, the problem is still present.
To make things even stranger, cluster is in "HEALTH OK" state and there are
no apparent issues on OSDs.

We then noticed that not all RBD images were lost, but only those created
in pools placed on HDD class devices. There are  a number rbd bench
--io-type write test_hdd --pool=testhddof images in SSD class pools that
are not affected by the problem, so we  momentary moved (cloned) most
important images to SSD pools to have them back and working.

Now I'm seeking community help on how to investigate the problem. We did
create a new HDD class pool with a new image for test purposes, but it
can't be mapped.
In a few tests, mapping in windows succeded but then, the device was
immediately removed because of a stale connection and lack of communication
with the device.

Testing from a cluster node with:
rbd bench --io-type write test_hdd --pool=testhdd
works perfectly so OSDs and cluster seems to be fine... we are quite lost
at this.

Any suggestion on what to check?

Thanks in advance for your help!

Regards,

Daniele
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx