Re: Remove an OSD with hardware issue caused rgw 503

Mary Zhang <maryzhang0920@xxxxxxxxx> · Fri, 26 Apr 2024 08:24:16 -0700

Thank you Eugen for your warm help!

I'm trying to understand the difference between 2 methods.
For method 1, or "ceph orch osd rm osd_id", OSD Service — Ceph Documentation
<https://docs.ceph.com/en/latest/cephadm/services/osd/#remove-an-osd> says
it involves 2 steps:

   1.

   evacuating all placement groups (PGs) from the OSD
   2.

   removing the PG-free OSD from the cluster

For method 2, or the procedure you recommended, Adding/Removing OSDs — Ceph
Documentation
<https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#removing-osds-manual>
says
"After the OSD has been taken out of the cluster, Ceph begins rebalancing
the cluster by migrating placement groups out of the OSD that was removed.
"

What's the difference between "evacuating PGs" in method 1 and "migrating
PGs" in method 2? I think method 1 must read the OSD to be removed.
Otherwise, we would not see slow ops warning. Does method 2 not involve
reading this OSD?

Thanks,
Mary

On Fri, Apr 26, 2024 at 5:15 AM Eugen Block <eblock@xxxxxx> wrote:

> Hi,
>
> if you remove the OSD this way, it will be drained. Which means that
> it will try to recover PGs from this OSD, and in case of hardware
> failure it might lead to slow requests. It might make sense to
> forcefully remove the OSD without draining:
>
> - stop the osd daemon
> - mark it as out
> - osd purge <id|osd.id> [--force] [--yes-i-really-mean-it]
>
> Regards,
> Eugen
>
> Zitat von Mary Zhang <maryzhang0920@xxxxxxxxx>:
>
> > Hi,
> >
> > We recently removed an osd from our Cepth cluster. Its underlying disk
> has
> > a hardware issue.
> >
> > We use command: ceph orch osd rm osd_id --zap
> >
> > During the process, sometimes ceph cluster enters warning state with slow
> > ops on this osd. Our rgw also failed to respond to requests and returned
> > 503.
> >
> > We restarted rgw daemon to make it work again. But the same failure
> occured
> > from time to time. Eventually we noticed that rgw 503 error is a result
> of
> > osd slow ops.
> >
> > Our cluster has 18 hosts and 210 OSDs. We expect remove an osd with
> > hardware issue won't impact cluster performance & rgw availbility. Is our
> > expectation reasonable? What's the best way to handle osd with hardware
> > failures?
> >
> > Thank you in advance for any comments or suggestions.
> >
> > Best Regards,
> > Mary Zhang
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx