Interesting and yea does sound like a bug of sorts. I would consider increasing your osd_heartbeat_grace (at global scope) maybe by 2x (to 40 if currently at default) to see you through the drain. What version are you using? Respectfully, *Wes Dillingham* LinkedIn <http://www.linkedin.com/in/wesleydillingham> wes@xxxxxxxxxxxxxxxxx On Thu, Oct 17, 2024 at 9:20 AM Frank Schilder <frans@xxxxxx> wrote: > Hi all, > > I would like to share some preliminary experience. Just setting OSDs "out" > manually (ceph osd out ID) does work as intended. the OSDs are drained and > their data is placed on other OSDs on the same host. This also survives > reboots of OSDs and peering and this turns out to be important. > > I make the very strange observation that the OSDs that are drained are > getting marked down quite often. This actually gets worse over time, the > fewer PGs are left the more frequent are these "OSD marked down - OSD still > running wrongly marked down by mon - OSD boot" events and I'm a bit at a > loss what the cause might be. This is exclusively limited to OSDs that are > marked up+out, none of the up+in OSDs shows that behavior. There seems no > correlation with anything else present, its all of the OSDs going down->up > (one at a time). > > Some of these restarts might have to do with disk errors, but I doubt all > do. There seems to be something else here at play. I don't think this is > expected and maybe someone has additional information here. > > We are almost done with the evacuation. I will report back how the > replacement+rebalancing is going. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder <frans@xxxxxx> > Sent: Friday, October 11, 2024 12:18 PM > To: Robert Sander; ceph-users@xxxxxxx > Subject: Re: Procedure for temporary evacuation and > replacement > > Hi Robert, > > thanks, that solves it then. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Robert Sander <r.sander@xxxxxxxxxxxxxxxxxxx> > Sent: Friday, October 11, 2024 10:20 AM > To: ceph-users@xxxxxxx > Subject: Re: Procedure for temporary evacuation and > replacement > > On 10/11/24 10:07, Frank Schilder wrote: > > Only problem is that setting an OSD OUT might not be sticky. If the OSD > reboots for some reason it might mark itself IN again. > > The Ceph cluster distinguishes between manually marked out ("ceph osd > out N") and automatically marked out, when an OSD is down for more than > 10 minutes. > > Manually marked out OSDs do not mark themselves in again. > > Regards > -- > Robert Sander > Heinlein Consulting GmbH > Schwedter Str. 8/9b, 10119 Berlin > > https://www.heinlein-support.de > > Tel: 030 / 405051-43 > Fax: 030 / 405051-19 > > Amtsgericht Berlin-Charlottenburg - HRB 220009 B > Geschäftsführer: Peer Heinlein - Sitz: Berlin > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx