Re: Remove an OSD with hardware issue caused rgw 503

Mary Zhang <maryzhang0920@xxxxxxxxx> · Tue, 30 Apr 2024 08:51:23 -0700

Sounds good. Thank you Kevin and have a nice day!

Best Regards,
Mary

On Tue, Apr 30, 2024, 8:21 AM Frank Schilder <frans@xxxxxx> wrote:

> I think you are panicking way too much. Chances are that you will never
> need that command, so don't get fussed out by an old post.
>
> Just follow what I wrote and, in the extremely rare case that recovery
> does not complete due to missing information, send an e-mail to this list
> and state that you still have the disk of the down OSD. Someone will send
> you the export/import commands within a short time.
>
> So stop worrying and just administrate your cluster with common storage
> admin sense.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Mary Zhang <maryzhang0920@xxxxxxxxx>
> Sent: Tuesday, April 30, 2024 5:00 PM
> To: Frank Schilder
> Cc: Eugen Block; ceph-users@xxxxxxx; Wesley Dillingham
> Subject: Re:  Re: Remove an OSD with hardware issue caused rgw
> 503
>
> Thank you Frank for sharing such valuable experience! I really appreciate
> it.
> We observe similar timelines: it took more than 1 week to drain our OSD.
> Regarding export PGs from failed disk and inject it back to the cluster,
> do you have any documentations? I find this online Ceph.io — Incomplete PGs
> -- OH MY!<https://ceph.io/en/news/blog/2015/incomplete-pgs-oh-my/>, but
> not sure whether it's the standard process.
>
> Thanks,
> Mary
>
> On Tue, Apr 30, 2024 at 3:27 AM Frank Schilder <frans@xxxxxx<mailto:
> frans@xxxxxx>> wrote:
> Hi all,
>
> I second Eugen's recommendation. We have a cluster with large HDD OSDs
> where the following timings are found:
>
> - drain an OSD: 2 weeks.
> - down an OSD and let cluster recover: 6 hours.
>
> The drain OSD procedure is - in my experience - a complete waste of time,
> actually puts your cluster at higher risk of a second failure (its not
> guaranteed that the bad PG(s) is/are drained first) and also screws up all
> sorts of internal operations like scrub etc for an unnecessarily long time.
> The recovery procedure is much faster, because it uses all-to-all recovery
> while drain is limited to no more than max_backfills PGs at a time and your
> broken disk sits much longer in the cluster.
>
> On SSDs the "down OSD"-method shows a similar speed-up factor.
>
> For a security measure, don't destroy the OSD right away, wait for
> recovery to complete and only then destroy the OSD and throw away the disk.
> In case an error occurs during recovery, you can almost always still export
> PGs from a failed disk and inject it back into the cluster. This, however,
> requires to take disks out as soon as they show problems and before they
> fail hard. Keep a little bit of life time to have a chance to recover data.
> Look at the manual of ddrescue why it is important to stop IO from a
> failing disk as soon as possible.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Eugen Block <eblock@xxxxxx<mailto:eblock@xxxxxx>>
> Sent: Saturday, April 27, 2024 10:29 AM
> To: Mary Zhang
> Cc: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>; Wesley Dillingham
> Subject:  Re: Remove an OSD with hardware issue caused rgw 503
>
> If the rest of the cluster is healthy and your resiliency is
> configured properly, for example to sustain the loss of one or more
> hosts at a time, you don’t need to worry about a single disk. Just
> take it out and remove it (forcefully) so it doesn’t have any clients
> anymore. Ceph will immediately assign different primary OSDs and your
> clients will be happy again. ;-)
>
> Zitat von Mary Zhang <maryzhang0920@xxxxxxxxx<mailto:
> maryzhang0920@xxxxxxxxx>>:
>
> > Thank you Wesley for the clear explanation between the 2 methods!
> > The tracker issue you mentioned https://tracker.ceph.com/issues/44400
> talks
> > about primary-affinity. Could primary-affinity help remove an OSD with
> > hardware issue from the cluster gracefully?
> >
> > Thanks,
> > Mary
> >
> >
> > On Fri, Apr 26, 2024 at 8:43 AM Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx
> <mailto:wes@xxxxxxxxxxxxxxxxx>>
> > wrote:
> >
> >> What you want to do is to stop the OSD (and all its copies of data it
> >> contains) by stopping the OSD service immediately. The downside of this
> >> approach is it causes the PGs on that OSD to be degraded. But the
> upside is
> >> the OSD which has bad hardware is immediately no  longer participating
> in
> >> any client IO (the source of your RGW 503s). In this situation the PGs
> go
> >> into degraded+backfilling
> >>
> >> The alternative method is to keep the failing OSD up and in the cluster
> >> but slowly migrate the data off of it, this would be a long drawn out
> >> period of time in which the failing disk would continue to serve client
> >> reads and also facilitate backfill but you wouldnt take a copy of the
> data
> >> out of the cluster and cause degraded PGs. In this scenario the PGs
> would
> >> be remapped+backfilling
> >>
> >> I tried to find a way to have your cake and eat it to in relation to
> this
> >> "predicament" in this tracker issue:
> https://tracker.ceph.com/issues/44400
> >> but it was deemed "wont fix".
> >>
> >> Respectfully,
> >>
> >> *Wes Dillingham*
> >> LinkedIn <http://www.linkedin.com/in/wesleydillingham>
> >> wes@xxxxxxxxxxxxxxxxx<mailto:wes@xxxxxxxxxxxxxxxxx>
> >>
> >>
> >>
> >>
> >> On Fri, Apr 26, 2024 at 11:25 AM Mary Zhang <maryzhang0920@xxxxxxxxx
> <mailto:maryzhang0920@xxxxxxxxx>>
> >> wrote:
> >>
> >>> Thank you Eugen for your warm help!
> >>>
> >>> I'm trying to understand the difference between 2 methods.
> >>> For method 1, or "ceph orch osd rm osd_id", OSD Service — Ceph
> >>> Documentation
> >>> <https://docs.ceph.com/en/latest/cephadm/services/osd/#remove-an-osd>
> >>> says
> >>> it involves 2 steps:
> >>>
> >>>    1.
> >>>
> >>>    evacuating all placement groups (PGs) from the OSD
> >>>    2.
> >>>
> >>>    removing the PG-free OSD from the cluster
> >>>
> >>> For method 2, or the procedure you recommended, Adding/Removing OSDs —
> >>> Ceph
> >>> Documentation
> >>> <
> >>>
> https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#removing-osds-manual
> >>> >
> >>> says
> >>> "After the OSD has been taken out of the cluster, Ceph begins
> rebalancing
> >>> the cluster by migrating placement groups out of the OSD that was
> removed.
> >>> "
> >>>
> >>> What's the difference between "evacuating PGs" in method 1 and
> "migrating
> >>> PGs" in method 2? I think method 1 must read the OSD to be removed.
> >>> Otherwise, we would not see slow ops warning. Does method 2 not involve
> >>> reading this OSD?
> >>>
> >>> Thanks,
> >>> Mary
> >>>
> >>> On Fri, Apr 26, 2024 at 5:15 AM Eugen Block <eblock@xxxxxx<mailto:
> eblock@xxxxxx>> wrote:
> >>>
> >>> > Hi,
> >>> >
> >>> > if you remove the OSD this way, it will be drained. Which means that
> >>> > it will try to recover PGs from this OSD, and in case of hardware
> >>> > failure it might lead to slow requests. It might make sense to
> >>> > forcefully remove the OSD without draining:
> >>> >
> >>> > - stop the osd daemon
> >>> > - mark it as out
> >>> > - osd purge <id|osd.id<http://osd.id>> [--force]
> [--yes-i-really-mean-it]
> >>> >
> >>> > Regards,
> >>> > Eugen
> >>> >
> >>> > Zitat von Mary Zhang <maryzhang0920@xxxxxxxxx<mailto:
> maryzhang0920@xxxxxxxxx>>:
> >>> >
> >>> > > Hi,
> >>> > >
> >>> > > We recently removed an osd from our Cepth cluster. Its underlying
> disk
> >>> > has
> >>> > > a hardware issue.
> >>> > >
> >>> > > We use command: ceph orch osd rm osd_id --zap
> >>> > >
> >>> > > During the process, sometimes ceph cluster enters warning state
> with
> >>> slow
> >>> > > ops on this osd. Our rgw also failed to respond to requests and
> >>> returned
> >>> > > 503.
> >>> > >
> >>> > > We restarted rgw daemon to make it work again. But the same failure
> >>> > occured
> >>> > > from time to time. Eventually we noticed that rgw 503 error is a
> >>> result
> >>> > of
> >>> > > osd slow ops.
> >>> > >
> >>> > > Our cluster has 18 hosts and 210 OSDs. We expect remove an osd with
> >>> > > hardware issue won't impact cluster performance & rgw availbility.
> Is
> >>> our
> >>> > > expectation reasonable? What's the best way to handle osd with
> >>> hardware
> >>> > > failures?
> >>> > >
> >>> > > Thank you in advance for any comments or suggestions.
> >>> > >
> >>> > > Best Regards,
> >>> > > Mary Zhang
> >>> > > _______________________________________________
> >>> > > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:
> ceph-users@xxxxxxx>
> >>> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>
> >>> >
> >>> >
> >>> > _______________________________________________
> >>> > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:
> ceph-users@xxxxxxx>
> >>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>
> >>> >
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:
> ceph-users@xxxxxxx>
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>
> >>>
> >>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx