Ceph OSD purge doesn't work while rebalancing

Benoît Knecht <bknecht@xxxxxxxxxxxxx> · Fri, 22 Apr 2022 07:25:26 +0000

Hi,

We use the following procedure to remove an OSD from a Ceph cluster (to replace
a defective disk for instance):

  # ceph osd crush reweight 559 0
  (Wait for the cluster to rebalance.)
  # ceph osd out 559
  # ceph osd ok-to-stop 559
  # ceph osd safe-to-destroy 559
  (Stop the OSD daemon.)
  # ceph osd purge 559

This works great when there's no rebalancing happening on the cluster, but if
there is, the last step (ceph osd purge 559) fails with

  # ceph osd purge 559
  Error EAGAIN: OSD(s) 559 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions.
  You can proceed by passing --force, but be warned that this will likely mean real, permanent data loss.

But none of the PGs are degraded, so it isn't clear to me why Ceph thinks this
is a risky operation. The only PGs that are not active+clean are
active+remapped+backfill_wait or active+remapped+backfilling.

Is the ceph osd purge command overly cautious, or am I overlooking an edge-case
that could lead to data loss? I know I could use --force, but I don't want to
override these safety checks if they're legitimate.

Cheers,

--
Ben

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx