I recently had a similar issue on one of my clusters that might be related. I found that when new OSDs were added to the cluster they were taking a long time to start. This ended up being caused by the new OSDs needing to pull down a couple hundred thousand osdmaps from the mon nodes. To see if you're also affected by this, try running the following command: ceph report 2>/dev/null | jq '(.osdmap_last_committed - .osdmap_first_committed)' This number should be between 500-1000 on a healthy cluster. I've seen this as high as 4.8 million before (roughly 50% of the data stored on the cluster ended up being osdmaps!) If you're curious how large a single osdmap is, you can run this command to save the current osdmap to a file: ceph osd getmap -o [filename] This appears to be a bug that should be fixed in the latest releases of Ceph (Quincy 17.2.8 & Reef 18.2.4) based on this report: https://tracker.ceph.com/issues/63883 In the meantime, if you are seeing a large difference between the first and last committed osdmaps you can usually clear that up by restarting each of the mon daemons sequentially, starting with the primary and waiting for it to finish peering before moving on to the next one. Doing this on the first cluster I mentioned above reduced the startup time of the new OSDs to seconds from 10-20 minutes each! Bryan From: Gregory Orange <gregory.orange@xxxxxxxxxxxxx> Date: Thursday, January 23, 2025 at 01:52 To: ceph-users@xxxxxxx <ceph-users@xxxxxxx> Subject: Re: Slow initial boot of OSDs in large cluster with unclean state Sometimes starting an OSD can take up to 20 minutes, so there may be some shared experience there. However, apart from a harrowing period last year[1] we live in HEALTH_OK most of the time. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx