Hi all, I recently migrated my Ceph cluster from *ceph-ansible* to *cephadm* (about five months ago) and upgraded from *Pacific 16.2.11* to *Quincy (latest at the time)*, followed by an upgrade to *Reef 18.2.4* two months later - due to running an unsupported version of Ceph. Since this migration and upgrade, I’ve noticed unexpected behavior in the cluster, particularly related to OSD state awareness and balancer efficiency. *1. OSD Nearfull Not Reported Until Restart* I had an OSD exceed its configured nearfull threshold, but *Ceph did not detect or report it* via ceph status. As a result, the cluster entered a degraded state without any warnings. Only after manually restarting the affected OSD did Ceph recognize the nearfull state and update the corresponding pools accordingly. This behavior did not occur in *Pacific/ceph-ansible*—Ceph would previously detect and act on the nearfull condition without requiring a restart. This has been a common recurrence since the migration/upgrade. *2. injectargs Not Taking Effect Until OSD Restart* I've also observed that ceph tell osd.X injectargs --command ... often has no effect. The OSD does not seem to apply the new arguments until it is *manually restarted*, at which point I can modify values via injectargs as expected. However, after a few hours or days, the issue reappears, requiring another restart to modify runtime settings. *3. Ceph Balancer and PG Remapping Issues* The Ceph balancer appears to be operating, but its behavior seems inefficient compared to what we experienced on *Pacific*. It often fails to optimize data distribution effectively, and I have to rely on the *pgremapper* tool to manually intervene. Restarting OSDs seems to improve the balancer’s effectiveness temporarily, suggesting that stale OSD state information may be contributing to the issue. Since this is a *high-performance computing (HPC) environment*, manually restarting OSDs on a regular basis is not a viable solution. These issues did not occur when we were running *Pacific with ceph-ansible*, and I’m wondering if others have experienced similar problems after migrating to *cephadm* and/or upgrading to *Quincy/Reef*. I noticed people on reddit with the same issue, but their resolution was that they "I switch off my whole ceph cluster & switch it back on - to get it working 100% again, - DAILY" Has anyone else encountered these behaviors? Are there any known bugs or workarounds that could help restore expected OSD state tracking and balancer efficiency? Any insights would be greatly appreciated! Thanks, -- *Jeremi-Ernst Avenant, Mr.*Cloud Infrastructure Specialist Inter-University Institute for Data Intensive Astronomy 5th Floor, Department of Physics and Astronomy, University of Cape Town Tel: 021 959 4137 <0219592327> Web: www.idia.ac.za | www.ilifu.ac.za E-mail (IDIA): jeremi@xxxxxxxxxx <mfundo@xxxxxxxxxx> Rondebosch, Cape Town, 7600, South Africa _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx