Removing failing OSD with cephadm?

Matt Larson <larsonmattr@xxxxxxxxx> · Sat, 18 Feb 2023 01:30:36 -0600

I have an OSD that is causing slow ops, and appears to be backed by a
failing drive according to smartctl outputs.  I am using cephadm, and
wondering what is the best way to remove this drive from the cluster and
proper steps to replace the disk?

Mark the osd.35 as out.

`sudo ceph osd out osd.35`

Then mark osd.35 as down.

`sudo ceph osd down osd.35`

 The OSD is marked as out, but it does come back up after a couple of
seconds.  I do not know if that is a problem or to just let the drive stay
online as long as it lasts during the removal from the cluster.

 After the recovery completes, I would then `destroy` the osd:

`ceph osd destroy {id} --yes-i-really-mean-it`

(https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/)

Besides checking steps above, my question now is ..* If the drive is acting
very slow and causing slow ops, should I be trying to shut down its OSD
and keep it down? There is an example to stop the OSD on the server using
systemctl, outside of cephadm:*

ssh {osd-host}sudo systemctl stop ceph-osd@{osd-num}

Thanks,
  Matt

-- 
Matt Larson, PhD
Madison, WI  53705 U.S.A.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx