Re: Remove an OSD with hardware issue caused rgw 503

Mary Zhang <maryzhang0920@xxxxxxxxx> · Tue, 30 Apr 2024 08:54:39 -0700

Sorry Frank, I typed the wrong name.

On Tue, Apr 30, 2024, 8:51 AM Mary Zhang <maryzhang0920@xxxxxxxxx> wrote:

> Sounds good. Thank you Kevin and have a nice day!
>
> Best Regards,
> Mary
>
> On Tue, Apr 30, 2024, 8:21 AM Frank Schilder <frans@xxxxxx> wrote:
>
>> I think you are panicking way too much. Chances are that you will never
>> need that command, so don't get fussed out by an old post.
>>
>> Just follow what I wrote and, in the extremely rare case that recovery
>> does not complete due to missing information, send an e-mail to this list
>> and state that you still have the disk of the down OSD. Someone will send
>> you the export/import commands within a short time.
>>
>> So stop worrying and just administrate your cluster with common storage
>> admin sense.
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Mary Zhang <maryzhang0920@xxxxxxxxx>
>> Sent: Tuesday, April 30, 2024 5:00 PM
>> To: Frank Schilder
>> Cc: Eugen Block; ceph-users@xxxxxxx; Wesley Dillingham
>> Subject: Re:  Re: Remove an OSD with hardware issue caused
>> rgw 503
>>
>> Thank you Frank for sharing such valuable experience! I really appreciate
>> it.
>> We observe similar timelines: it took more than 1 week to drain our OSD.
>> Regarding export PGs from failed disk and inject it back to the cluster,
>> do you have any documentations? I find this online Ceph.io — Incomplete PGs
>> -- OH MY!<https://ceph.io/en/news/blog/2015/incomplete-pgs-oh-my/>, but
>> not sure whether it's the standard process.
>>
>> Thanks,
>> Mary
>>
>> On Tue, Apr 30, 2024 at 3:27 AM Frank Schilder <frans@xxxxxx<mailto:
>> frans@xxxxxx>> wrote:
>> Hi all,
>>
>> I second Eugen's recommendation. We have a cluster with large HDD OSDs
>> where the following timings are found:
>>
>> - drain an OSD: 2 weeks.
>> - down an OSD and let cluster recover: 6 hours.
>>
>> The drain OSD procedure is - in my experience - a complete waste of time,
>> actually puts your cluster at higher risk of a second failure (its not
>> guaranteed that the bad PG(s) is/are drained first) and also screws up all
>> sorts of internal operations like scrub etc for an unnecessarily long time.
>> The recovery procedure is much faster, because it uses all-to-all recovery
>> while drain is limited to no more than max_backfills PGs at a time and your
>> broken disk sits much longer in the cluster.
>>
>> On SSDs the "down OSD"-method shows a similar speed-up factor.
>>
>> For a security measure, don't destroy the OSD right away, wait for
>> recovery to complete and only then destroy the OSD and throw away the disk.
>> In case an error occurs during recovery, you can almost always still export
>> PGs from a failed disk and inject it back into the cluster. This, however,
>> requires to take disks out as soon as they show problems and before they
>> fail hard. Keep a little bit of life time to have a chance to recover data.
>> Look at the manual of ddrescue why it is important to stop IO from a
>> failing disk as soon as possible.
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Eugen Block <eblock@xxxxxx<mailto:eblock@xxxxxx>>
>> Sent: Saturday, April 27, 2024 10:29 AM
>> To: Mary Zhang
>> Cc: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>; Wesley Dillingham
>> Subject:  Re: Remove an OSD with hardware issue caused rgw 503
>>
>> If the rest of the cluster is healthy and your resiliency is
>> configured properly, for example to sustain the loss of one or more
>> hosts at a time, you don’t need to worry about a single disk. Just
>> take it out and remove it (forcefully) so it doesn’t have any clients
>> anymore. Ceph will immediately assign different primary OSDs and your
>> clients will be happy again. ;-)
>>
>> Zitat von Mary Zhang <maryzhang0920@xxxxxxxxx<mailto:
>> maryzhang0920@xxxxxxxxx>>:
>>
>> > Thank you Wesley for the clear explanation between the 2 methods!
>> > The tracker issue you mentioned https://tracker.ceph.com/issues/44400
>> talks
>> > about primary-affinity. Could primary-affinity help remove an OSD with
>> > hardware issue from the cluster gracefully?
>> >
>> > Thanks,
>> > Mary
>> >
>> >
>> > On Fri, Apr 26, 2024 at 8:43 AM Wesley Dillingham <
>> wes@xxxxxxxxxxxxxxxxx<mailto:wes@xxxxxxxxxxxxxxxxx>>
>> > wrote:
>> >
>> >> What you want to do is to stop the OSD (and all its copies of data it
>> >> contains) by stopping the OSD service immediately. The downside of this
>> >> approach is it causes the PGs on that OSD to be degraded. But the
>> upside is
>> >> the OSD which has bad hardware is immediately no  longer participating
>> in
>> >> any client IO (the source of your RGW 503s). In this situation the PGs
>> go
>> >> into degraded+backfilling
>> >>
>> >> The alternative method is to keep the failing OSD up and in the cluster
>> >> but slowly migrate the data off of it, this would be a long drawn out
>> >> period of time in which the failing disk would continue to serve client
>> >> reads and also facilitate backfill but you wouldnt take a copy of the
>> data
>> >> out of the cluster and cause degraded PGs. In this scenario the PGs
>> would
>> >> be remapped+backfilling
>> >>
>> >> I tried to find a way to have your cake and eat it to in relation to
>> this
>> >> "predicament" in this tracker issue:
>> https://tracker.ceph.com/issues/44400
>> >> but it was deemed "wont fix".
>> >>
>> >> Respectfully,
>> >>
>> >> *Wes Dillingham*
>> >> LinkedIn <http://www.linkedin.com/in/wesleydillingham>
>> >> wes@xxxxxxxxxxxxxxxxx<mailto:wes@xxxxxxxxxxxxxxxxx>
>> >>
>> >>
>> >>
>> >>
>> >> On Fri, Apr 26, 2024 at 11:25 AM Mary Zhang <maryzhang0920@xxxxxxxxx
>> <mailto:maryzhang0920@xxxxxxxxx>>
>> >> wrote:
>> >>
>> >>> Thank you Eugen for your warm help!
>> >>>
>> >>> I'm trying to understand the difference between 2 methods.
>> >>> For method 1, or "ceph orch osd rm osd_id", OSD Service — Ceph
>> >>> Documentation
>> >>> <https://docs.ceph.com/en/latest/cephadm/services/osd/#remove-an-osd>
>> >>> says
>> >>> it involves 2 steps:
>> >>>
>> >>>    1.
>> >>>
>> >>>    evacuating all placement groups (PGs) from the OSD
>> >>>    2.
>> >>>
>> >>>    removing the PG-free OSD from the cluster
>> >>>
>> >>> For method 2, or the procedure you recommended, Adding/Removing OSDs —
>> >>> Ceph
>> >>> Documentation
>> >>> <
>> >>>
>> https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#removing-osds-manual
>> >>> >
>> >>> says
>> >>> "After the OSD has been taken out of the cluster, Ceph begins
>> rebalancing
>> >>> the cluster by migrating placement groups out of the OSD that was
>> removed.
>> >>> "
>> >>>
>> >>> What's the difference between "evacuating PGs" in method 1 and
>> "migrating
>> >>> PGs" in method 2? I think method 1 must read the OSD to be removed.
>> >>> Otherwise, we would not see slow ops warning. Does method 2 not
>> involve
>> >>> reading this OSD?
>> >>>
>> >>> Thanks,
>> >>> Mary
>> >>>
>> >>> On Fri, Apr 26, 2024 at 5:15 AM Eugen Block <eblock@xxxxxx<mailto:
>> eblock@xxxxxx>> wrote:
>> >>>
>> >>> > Hi,
>> >>> >
>> >>> > if you remove the OSD this way, it will be drained. Which means that
>> >>> > it will try to recover PGs from this OSD, and in case of hardware
>> >>> > failure it might lead to slow requests. It might make sense to
>> >>> > forcefully remove the OSD without draining:
>> >>> >
>> >>> > - stop the osd daemon
>> >>> > - mark it as out
>> >>> > - osd purge <id|osd.id<http://osd.id>> [--force]
>> [--yes-i-really-mean-it]
>> >>> >
>> >>> > Regards,
>> >>> > Eugen
>> >>> >
>> >>> > Zitat von Mary Zhang <maryzhang0920@xxxxxxxxx<mailto:
>> maryzhang0920@xxxxxxxxx>>:
>> >>> >
>> >>> > > Hi,
>> >>> > >
>> >>> > > We recently removed an osd from our Cepth cluster. Its underlying
>> disk
>> >>> > has
>> >>> > > a hardware issue.
>> >>> > >
>> >>> > > We use command: ceph orch osd rm osd_id --zap
>> >>> > >
>> >>> > > During the process, sometimes ceph cluster enters warning state
>> with
>> >>> slow
>> >>> > > ops on this osd. Our rgw also failed to respond to requests and
>> >>> returned
>> >>> > > 503.
>> >>> > >
>> >>> > > We restarted rgw daemon to make it work again. But the same
>> failure
>> >>> > occured
>> >>> > > from time to time. Eventually we noticed that rgw 503 error is a
>> >>> result
>> >>> > of
>> >>> > > osd slow ops.
>> >>> > >
>> >>> > > Our cluster has 18 hosts and 210 OSDs. We expect remove an osd
>> with
>> >>> > > hardware issue won't impact cluster performance & rgw
>> availbility. Is
>> >>> our
>> >>> > > expectation reasonable? What's the best way to handle osd with
>> >>> hardware
>> >>> > > failures?
>> >>> > >
>> >>> > > Thank you in advance for any comments or suggestions.
>> >>> > >
>> >>> > > Best Regards,
>> >>> > > Mary Zhang
>> >>> > > _______________________________________________
>> >>> > > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:
>> ceph-users@xxxxxxx>
>> >>> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
>> ceph-users-leave@xxxxxxx>
>> >>> >
>> >>> >
>> >>> > _______________________________________________
>> >>> > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:
>> ceph-users@xxxxxxx>
>> >>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
>> ceph-users-leave@xxxxxxx>
>> >>> >
>> >>> _______________________________________________
>> >>> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:
>> ceph-users@xxxxxxx>
>> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
>> ceph-users-leave@xxxxxxx>
>> >>>
>> >>
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
>> ceph-users-leave@xxxxxxx>
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx