Re: Slow initial boot of OSDs in large cluster with unclean state

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Tue, 7 Jan 2025 13:41:30 -0500

> On our 6000+ HDD OSD cluster (pacific)

That’s the bleeding edge in a number of respects.  Updating to at least Reef would bring various improvements, and I have some suggestions I'd like to*love* to run by you wrt upgrade speed in such a cluster, if you’re using cephadm / ceph orch.  Would also love offline to get a copy of any non-default values you’ve found valuable.

> we've been noticing takes significantly longer for brand new OSDs to go from booting to active when the cluster has been in a state of flux for some time. It can take over an hour for a newly created OSD to be marked up in some cases!

That’s pretty extreme.

> We've just put up with it for some time, but I finally got annoyed enough with it to look into it today...
> 
> Looking at the logs of a new OSD when it's starting:
> 
> 2025-01-07T13:44:05.534+0000 7f0b8b830700  3 osd.2016 5165598 handle_osd_map epochs [5165599,5165638], i have 5165598, src has [5146718,5175990]
> 2025-01-07T13:44:08.988+0000 7f0b8d6ed700 10 osd.2016 5165638  msg say newest map is 5175990, requesting more
> 2025-01-07T13:44:08.990+0000 7f0b8b830700  3 osd.2016 5165638 handle_osd_map epochs [5165639,5165678], i have 5165638, src has [5146718,5175990]
> 2025-01-07T13:44:12.391+0000 7f0b8d6ed700 10 osd.2016 5165678  msg say newest map is 5175990, requesting more
> 2025-01-07T13:44:12.394+0000 7f0b8b830700  3 osd.2016 5165678 handle_osd_map epochs [5165679,5165718], i have 5165678, src has [5146718,5175990]
> 2025-01-07T13:44:16.047+0000 7f0b8d6ed700 10 osd.2016 5165718  msg say newest map is 5175990, requesting more
> 
> It's pulling down OSD maps, 40 at a time, taking about 4 seconds each time. With the ~30,000(!) OSD maps it pulls down, it takes approximately an hour. At ~4MB a map, this then matches up with the ~115GB storage consumed by the resulting OSD with no PGs.
> 
> I realise the obvious answer here is don't leave big cluster in an unclean state for this long.

Patient:  It turns when I do this
Doctor:  Don’t do that.

> Currently we've got PGs that have been remapped for 5 days, which matches the 30,000 OSDMap epoch range perfectly.

Are you adding OSDs while the cluster is converging / backfilling / recovering?

> This is something we're always looking at from a procedure point of view e.g. keeping max_backfills as high as possible by default, ensuring balancer max_misplaced is appropriate, re-evaluating disk and node addition/removal processes. But the reality on this cluster is that sometimes these 'logjams' happen, and it would be good to understand if we can improve the OSD addition experience so we can continue to be flexible with our operation scheduling.

I might speculate that putting off OSD addition until the cluster is converged might help, maybe then a rolling mon compaction in advance of adding OSDs.

> The first thing I noted was the OSD block devices aren't busy during the OSDmap fetching process - they're barely doing 50MB/s and 50 wr/s. I started looking into raising 'osd_map_share_max_epochs' to hopefully increase the number of maps shared with the new OSD per request and improve the rate, but I balked a bit after realising I would have to do this across the whole cluster (I think, anyway, not actually sure where the maps are being pulled from at this point).