Issues with Ceph Cluster Behavior After Migration from ceph-ansible to cephadm and Upgrade to Quincy/Reef

Jeremi-Ernst Avenant <jeremi@xxxxxxxxxx> · Mon, 17 Feb 2025 10:55:50 +0200

Hi all,

I recently migrated my Ceph cluster from *ceph-ansible* to *cephadm* (about
five months ago) and upgraded from *Pacific 16.2.11* to *Quincy (latest at
the time)*, followed by an upgrade to *Reef 18.2.4* two months later - due
to running an unsupported version of Ceph. Since this migration and
upgrade, I’ve noticed unexpected behavior in the cluster, particularly
related to OSD state awareness and balancer efficiency.
*1. OSD Nearfull Not Reported Until Restart*

I had an OSD exceed its configured nearfull threshold, but *Ceph did not
detect or report it* via ceph status. As a result, the cluster entered a
degraded state without any warnings. Only after manually restarting the
affected OSD did Ceph recognize the nearfull state and update the
corresponding pools accordingly. This behavior did not occur in
*Pacific/ceph-ansible*—Ceph would previously detect and act on the nearfull
condition without requiring a restart. This has been a common recurrence
since the migration/upgrade.
*2. injectargs Not Taking Effect Until OSD Restart*

I've also observed that ceph tell osd.X injectargs --command ... often has
no effect. The OSD does not seem to apply the new arguments until it
is *manually
restarted*, at which point I can modify values via injectargs as expected.
However, after a few hours or days, the issue reappears, requiring another
restart to modify runtime settings.
*3. Ceph Balancer and PG Remapping Issues*

The Ceph balancer appears to be operating, but its behavior seems
inefficient compared to what we experienced on *Pacific*. It often fails to
optimize data distribution effectively, and I have to rely on the
*pgremapper* tool to manually intervene. Restarting OSDs seems to improve
the balancer’s effectiveness temporarily, suggesting that stale OSD state
information may be contributing to the issue.

Since this is a *high-performance computing (HPC) environment*, manually
restarting OSDs on a regular basis is not a viable solution. These issues
did not occur when we were running *Pacific with ceph-ansible*, and I’m
wondering if others have experienced similar problems after migrating to
*cephadm* and/or upgrading to *Quincy/Reef*.

I noticed people on reddit with the same issue, but their resolution was
that they "I switch off my whole ceph cluster & switch it back on - to get
it working 100% again, - DAILY"

Has anyone else encountered these behaviors? Are there any known bugs or
workarounds that could help restore expected OSD state tracking and
balancer efficiency?

Any insights would be greatly appreciated!

Thanks,

-- 

*Jeremi-Ernst Avenant, Mr.*Cloud Infrastructure Specialist
Inter-University Institute for Data Intensive Astronomy
5th Floor, Department of Physics and Astronomy,
University of Cape Town

Tel: 021 959 4137 <0219592327>
Web: www.idia.ac.za | www.ilifu.ac.za
E-mail (IDIA): jeremi@xxxxxxxxxx <mfundo@xxxxxxxxxx>
Rondebosch, Cape Town, 7600, South Africa
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx