Re: Procedure for temporary evacuation and replacement

Frank Schilder <frans@xxxxxx> · Fri, 18 Oct 2024 12:25:48 +0000

Hi Joshua,

thanks for this reply. It is ceph fs with comparably large spinners and a significant percentage of small files. Thanks for pointing out the config option. There were still a few PGs left on the disks and I had time to try a few settings. Not sure if the results are really representative though.

osd_delete_sleep = 10 : no real change
osd_delete_sleep = 60 : maybe better?
osd_delete_sleep = 300 : doesn't prevent OSDs from being marked down every now and then but seems to reduce both, frequency and impact.

Does this setting affect PG removal only or is it affecting other operations as well? Essentially: can I leave it at its current value or should I reset it to default?

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Joshua Baergen <jbaergen@xxxxxxxxxxxxxxxx>
Sent: Thursday, October 17, 2024 3:56 PM
To: Wesley Dillingham
Cc: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  Re: Procedure for temporary evacuation and replacement

Is this a high-object-count application (S3 or small files in cephfs)?
My guess is that they're going down at the end of PG deletions, where
a rocksdb scan needs to happen. This scan can be really slow and can
exceed heartbeat timeouts, among other things. Some improvements have
been made over major releases, so I'd be curious to know which release
you're using (but we've seen this in at least as recent as Pacific).

Given that you're completely draining these OSDs, a workaround that
we've used in the past is to set "osd_delete_sleep" to something
ridiculously high (say, 3600) for those OSDs, effectively disabling PG
removal, avoiding this issue.

The other possibility, assuming that this is an EC system, is that
you're seeing backfill source overload; there's no reservation in Ceph
today for backfill sources on EC and so there's no limits to the
number of active backfills a given OSD could be participating in as a
read source. This was one of the reasons why we built pgremapper for
our own OSD drain usecases.

Or maybe it's a combination of the two!

HTH,
Josh

On Thu, Oct 17, 2024 at 7:45 AM Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx> wrote:
>
> Interesting and yea does sound like a bug of sorts. I would consider
> increasing your osd_heartbeat_grace (at global scope) maybe by 2x (to 40 if
> currently at default) to see you through the drain. What version are you
> using?
>
> Respectfully,
>
> *Wes Dillingham*
> LinkedIn <http://www.linkedin.com/in/wesleydillingham>
> wes@xxxxxxxxxxxxxxxxx
>
>
>
>
> On Thu, Oct 17, 2024 at 9:20 AM Frank Schilder <frans@xxxxxx> wrote:
>
> > Hi all,
> >
> > I would like to share some preliminary experience. Just setting OSDs "out"
> > manually (ceph osd out ID) does work as intended. the OSDs are drained and
> > their data is placed on other OSDs on the same host. This also survives
> > reboots of OSDs and peering and this turns out to be important.
> >
> > I make the very strange observation that the OSDs that are drained are
> > getting marked down quite often. This actually gets worse over time, the
> > fewer PGs are left the more frequent are these "OSD marked down - OSD still
> > running wrongly marked down by mon - OSD boot" events and I'm a bit at a
> > loss what the cause might be. This is exclusively limited to OSDs that are
> > marked up+out, none of the up+in OSDs shows that behavior. There seems no
> > correlation with anything else present, its all of the OSDs going down->up
> > (one at a time).
> >
> > Some of these restarts might have to do with disk errors, but I doubt all
> > do. There seems to be something else here at play. I don't think this is
> > expected and maybe someone has additional information here.
> >
> > We are almost done with the evacuation. I will report back how the
> > replacement+rebalancing is going.
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Frank Schilder <frans@xxxxxx>
> > Sent: Friday, October 11, 2024 12:18 PM
> > To: Robert Sander; ceph-users@xxxxxxx
> > Subject:  Re: Procedure for temporary evacuation and
> > replacement
> >
> > Hi Robert,
> >
> > thanks, that solves it then.
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Robert Sander <r.sander@xxxxxxxxxxxxxxxxxxx>
> > Sent: Friday, October 11, 2024 10:20 AM
> > To: ceph-users@xxxxxxx
> > Subject:  Re: Procedure for temporary evacuation and
> > replacement
> >
> > On 10/11/24 10:07, Frank Schilder wrote:
> > > Only problem is that setting an OSD OUT might not be sticky. If the OSD
> > reboots for some reason it might mark itself IN again.
> >
> > The Ceph cluster distinguishes between manually marked out ("ceph osd
> > out N") and automatically marked out, when an OSD is down for more than
> > 10 minutes.
> >
> > Manually marked out OSDs do not mark themselves in again.
> >
> > Regards
> > --
> > Robert Sander
> > Heinlein Consulting GmbH
> > Schwedter Str. 8/9b, 10119 Berlin
> >
> > https://www.heinlein-support.de
> >
> > Tel: 030 / 405051-43
> > Fax: 030 / 405051-19
> >
> > Amtsgericht Berlin-Charlottenburg - HRB 220009 B
> > Geschäftsführer: Peer Heinlein - Sitz: Berlin
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx