Thank you Frank for sharing such valuable experience! I really appreciate it. We observe similar timelines: it took more than 1 week to drain our OSD. Regarding export PGs from failed disk and inject it back to the cluster, do you have any documentations? I find this online Ceph.io — Incomplete PGs -- OH MY! <https://ceph.io/en/news/blog/2015/incomplete-pgs-oh-my/>, but not sure whether it's the standard process. Thanks, Mary On Tue, Apr 30, 2024 at 3:27 AM Frank Schilder <frans@xxxxxx> wrote: > Hi all, > > I second Eugen's recommendation. We have a cluster with large HDD OSDs > where the following timings are found: > > - drain an OSD: 2 weeks. > - down an OSD and let cluster recover: 6 hours. > > The drain OSD procedure is - in my experience - a complete waste of time, > actually puts your cluster at higher risk of a second failure (its not > guaranteed that the bad PG(s) is/are drained first) and also screws up all > sorts of internal operations like scrub etc for an unnecessarily long time. > The recovery procedure is much faster, because it uses all-to-all recovery > while drain is limited to no more than max_backfills PGs at a time and your > broken disk sits much longer in the cluster. > > On SSDs the "down OSD"-method shows a similar speed-up factor. > > For a security measure, don't destroy the OSD right away, wait for > recovery to complete and only then destroy the OSD and throw away the disk. > In case an error occurs during recovery, you can almost always still export > PGs from a failed disk and inject it back into the cluster. This, however, > requires to take disks out as soon as they show problems and before they > fail hard. Keep a little bit of life time to have a chance to recover data. > Look at the manual of ddrescue why it is important to stop IO from a > failing disk as soon as possible. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Eugen Block <eblock@xxxxxx> > Sent: Saturday, April 27, 2024 10:29 AM > To: Mary Zhang > Cc: ceph-users@xxxxxxx; Wesley Dillingham > Subject: Re: Remove an OSD with hardware issue caused rgw 503 > > If the rest of the cluster is healthy and your resiliency is > configured properly, for example to sustain the loss of one or more > hosts at a time, you don’t need to worry about a single disk. Just > take it out and remove it (forcefully) so it doesn’t have any clients > anymore. Ceph will immediately assign different primary OSDs and your > clients will be happy again. ;-) > > Zitat von Mary Zhang <maryzhang0920@xxxxxxxxx>: > > > Thank you Wesley for the clear explanation between the 2 methods! > > The tracker issue you mentioned https://tracker.ceph.com/issues/44400 > talks > > about primary-affinity. Could primary-affinity help remove an OSD with > > hardware issue from the cluster gracefully? > > > > Thanks, > > Mary > > > > > > On Fri, Apr 26, 2024 at 8:43 AM Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx > > > > wrote: > > > >> What you want to do is to stop the OSD (and all its copies of data it > >> contains) by stopping the OSD service immediately. The downside of this > >> approach is it causes the PGs on that OSD to be degraded. But the > upside is > >> the OSD which has bad hardware is immediately no longer participating > in > >> any client IO (the source of your RGW 503s). In this situation the PGs > go > >> into degraded+backfilling > >> > >> The alternative method is to keep the failing OSD up and in the cluster > >> but slowly migrate the data off of it, this would be a long drawn out > >> period of time in which the failing disk would continue to serve client > >> reads and also facilitate backfill but you wouldnt take a copy of the > data > >> out of the cluster and cause degraded PGs. In this scenario the PGs > would > >> be remapped+backfilling > >> > >> I tried to find a way to have your cake and eat it to in relation to > this > >> "predicament" in this tracker issue: > https://tracker.ceph.com/issues/44400 > >> but it was deemed "wont fix". > >> > >> Respectfully, > >> > >> *Wes Dillingham* > >> LinkedIn <http://www.linkedin.com/in/wesleydillingham> > >> wes@xxxxxxxxxxxxxxxxx > >> > >> > >> > >> > >> On Fri, Apr 26, 2024 at 11:25 AM Mary Zhang <maryzhang0920@xxxxxxxxx> > >> wrote: > >> > >>> Thank you Eugen for your warm help! > >>> > >>> I'm trying to understand the difference between 2 methods. > >>> For method 1, or "ceph orch osd rm osd_id", OSD Service — Ceph > >>> Documentation > >>> <https://docs.ceph.com/en/latest/cephadm/services/osd/#remove-an-osd> > >>> says > >>> it involves 2 steps: > >>> > >>> 1. > >>> > >>> evacuating all placement groups (PGs) from the OSD > >>> 2. > >>> > >>> removing the PG-free OSD from the cluster > >>> > >>> For method 2, or the procedure you recommended, Adding/Removing OSDs — > >>> Ceph > >>> Documentation > >>> < > >>> > https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#removing-osds-manual > >>> > > >>> says > >>> "After the OSD has been taken out of the cluster, Ceph begins > rebalancing > >>> the cluster by migrating placement groups out of the OSD that was > removed. > >>> " > >>> > >>> What's the difference between "evacuating PGs" in method 1 and > "migrating > >>> PGs" in method 2? I think method 1 must read the OSD to be removed. > >>> Otherwise, we would not see slow ops warning. Does method 2 not involve > >>> reading this OSD? > >>> > >>> Thanks, > >>> Mary > >>> > >>> On Fri, Apr 26, 2024 at 5:15 AM Eugen Block <eblock@xxxxxx> wrote: > >>> > >>> > Hi, > >>> > > >>> > if you remove the OSD this way, it will be drained. Which means that > >>> > it will try to recover PGs from this OSD, and in case of hardware > >>> > failure it might lead to slow requests. It might make sense to > >>> > forcefully remove the OSD without draining: > >>> > > >>> > - stop the osd daemon > >>> > - mark it as out > >>> > - osd purge <id|osd.id> [--force] [--yes-i-really-mean-it] > >>> > > >>> > Regards, > >>> > Eugen > >>> > > >>> > Zitat von Mary Zhang <maryzhang0920@xxxxxxxxx>: > >>> > > >>> > > Hi, > >>> > > > >>> > > We recently removed an osd from our Cepth cluster. Its underlying > disk > >>> > has > >>> > > a hardware issue. > >>> > > > >>> > > We use command: ceph orch osd rm osd_id --zap > >>> > > > >>> > > During the process, sometimes ceph cluster enters warning state > with > >>> slow > >>> > > ops on this osd. Our rgw also failed to respond to requests and > >>> returned > >>> > > 503. > >>> > > > >>> > > We restarted rgw daemon to make it work again. But the same failure > >>> > occured > >>> > > from time to time. Eventually we noticed that rgw 503 error is a > >>> result > >>> > of > >>> > > osd slow ops. > >>> > > > >>> > > Our cluster has 18 hosts and 210 OSDs. We expect remove an osd with > >>> > > hardware issue won't impact cluster performance & rgw availbility. > Is > >>> our > >>> > > expectation reasonable? What's the best way to handle osd with > >>> hardware > >>> > > failures? > >>> > > > >>> > > Thank you in advance for any comments or suggestions. > >>> > > > >>> > > Best Regards, > >>> > > Mary Zhang > >>> > > _______________________________________________ > >>> > > ceph-users mailing list -- ceph-users@xxxxxxx > >>> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>> > > >>> > > >>> > _______________________________________________ > >>> > ceph-users mailing list -- ceph-users@xxxxxxx > >>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>> > > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@xxxxxxx > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>> > >> > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx