On 4/22/22 09:25, Benoît Knecht wrote:
Hi, We use the following procedure to remove an OSD from a Ceph cluster (to replace a defective disk for instance): # ceph osd crush reweight 559 0 (Wait for the cluster to rebalance.) # ceph osd out 559 # ceph osd ok-to-stop 559 # ceph osd safe-to-destroy 559 (Stop the OSD daemon.) # ceph osd purge 559 This works great when there's no rebalancing happening on the cluster, but if there is, the last step (ceph osd purge 559) fails with # ceph osd purge 559 Error EAGAIN: OSD(s) 559 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions. You can proceed by passing --force, but be warned that this will likely mean real, permanent data loss. But none of the PGs are degraded, so it isn't clear to me why Ceph thinks this is a risky operation. The only PGs that are not active+clean are active+remapped+backfill_wait or active+remapped+backfilling. Is the ceph osd purge command overly cautious, or am I overlooking an edge-case that could lead to data loss? I know I could use --force, but I don't want to override these safety checks if they're legitimate.
To me this looks like Ceph being overly cautious. It appears to only accept PGs in active+clean state. When you have not set "norebalance", "norecover", "nobackfill" an out OSD should not have PGs mapped to it.
Instead of purge you can do "ceph osd rm $id", "ceph osd auth rm $id" and "ceph osd crush rm $id" ... but that's probably the same as using "--force" with the purge command.
Gr. Stefan _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx