Thank you Eugen for your warm help! I'm trying to understand the difference between 2 methods. For method 1, or "ceph orch osd rm osd_id", OSD Service — Ceph Documentation <https://docs.ceph.com/en/latest/cephadm/services/osd/#remove-an-osd> says it involves 2 steps: 1. evacuating all placement groups (PGs) from the OSD 2. removing the PG-free OSD from the cluster For method 2, or the procedure you recommended, Adding/Removing OSDs — Ceph Documentation <https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#removing-osds-manual> says "After the OSD has been taken out of the cluster, Ceph begins rebalancing the cluster by migrating placement groups out of the OSD that was removed. " What's the difference between "evacuating PGs" in method 1 and "migrating PGs" in method 2? I think method 1 must read the OSD to be removed. Otherwise, we would not see slow ops warning. Does method 2 not involve reading this OSD? Thanks, Mary On Fri, Apr 26, 2024 at 5:15 AM Eugen Block <eblock@xxxxxx> wrote: > Hi, > > if you remove the OSD this way, it will be drained. Which means that > it will try to recover PGs from this OSD, and in case of hardware > failure it might lead to slow requests. It might make sense to > forcefully remove the OSD without draining: > > - stop the osd daemon > - mark it as out > - osd purge <id|osd.id> [--force] [--yes-i-really-mean-it] > > Regards, > Eugen > > Zitat von Mary Zhang <maryzhang0920@xxxxxxxxx>: > > > Hi, > > > > We recently removed an osd from our Cepth cluster. Its underlying disk > has > > a hardware issue. > > > > We use command: ceph orch osd rm osd_id --zap > > > > During the process, sometimes ceph cluster enters warning state with slow > > ops on this osd. Our rgw also failed to respond to requests and returned > > 503. > > > > We restarted rgw daemon to make it work again. But the same failure > occured > > from time to time. Eventually we noticed that rgw 503 error is a result > of > > osd slow ops. > > > > Our cluster has 18 hosts and 210 OSDs. We expect remove an osd with > > hardware issue won't impact cluster performance & rgw availbility. Is our > > expectation reasonable? What's the best way to handle osd with hardware > > failures? > > > > Thank you in advance for any comments or suggestions. > > > > Best Regards, > > Mary Zhang > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx