Re: Slow initial boot of OSDs in large cluster with unclean state

Thomas Byrne - STFC UKRI <tom.byrne@xxxxxxxxxx> · Wed, 8 Jan 2025 11:29:31 +0000

Hi Wes,

It works out at about five new osdmaps a minute, which is about normal for this cluster's state changes as far as I can tell. It'll drop down to 2-3 maps/minute during quiet periods, but the combination of the upmap balancer making changes and occasional OSD flaps or crashes due to hardware issues is enough to cause a fairly reliable rate of osdmap churn.

This churn is something that we are working on understanding, and reducing where possible, now that we know becoming a pain point for us.

Thanks,
Tom

________________________________
From: Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx>
Sent: Tuesday, January 7, 2025 18:41
To: Byrne, Thomas (STFC,RAL,SC) <tom.byrne@xxxxxxxxxx>
Cc: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
Subject: Re:  Slow initial boot of OSDs in large cluster with unclean state

It went from normal osdmap range 500-1000 maps to 30,000 maps in 5 days? That seems like excessive accumulation to me in a 5 day period.

Respectfully,

Wes Dillingham
LinkedIn<http://www.linkedin.com/in/wesleydillingham>
wes@xxxxxxxxxxxxxxxxx<mailto:wes@xxxxxxxxxxxxxxxxx>

On Tue, Jan 7, 2025 at 1:18 PM Thomas Byrne - STFC UKRI <tom.byrne@xxxxxxxxxx<mailto:tom.byrne@xxxxxxxxxx>> wrote:
Hi all,

On our 6000+ HDD OSD cluster (pacific), we've been noticing takes significantly longer for brand new OSDs to go from booting to active when the cluster has been in a state of flux for some time. It can take over an hour for a newly created OSD to be marked up in some cases! We've just put up with it for some time, but I finally got annoyed enough with it to look into it today...

Looking at the logs of a new OSD when it's starting:

2025-01-07T13:44:05.534+0000 7f0b8b830700  3 osd.2016 5165598 handle_osd_map epochs [5165599,5165638], i have 5165598, src has [5146718,5175990]
2025-01-07T13:44:08.988+0000 7f0b8d6ed700 10 osd.2016 5165638  msg say newest map is 5175990, requesting more
2025-01-07T13:44:08.990+0000 7f0b8b830700  3 osd.2016 5165638 handle_osd_map epochs [5165639,5165678], i have 5165638, src has [5146718,5175990]
2025-01-07T13:44:12.391+0000 7f0b8d6ed700 10 osd.2016 5165678  msg say newest map is 5175990, requesting more
2025-01-07T13:44:12.394+0000 7f0b8b830700  3 osd.2016 5165678 handle_osd_map epochs [5165679,5165718], i have 5165678, src has [5146718,5175990]
2025-01-07T13:44:16.047+0000 7f0b8d6ed700 10 osd.2016 5165718  msg say newest map is 5175990, requesting more

It's pulling down OSD maps, 40 at a time, taking about 4 seconds each time. With the ~30,000(!) OSD maps it pulls down, it takes approximately an hour. At ~4MB a map, this then matches up with the ~115GB storage consumed by the resulting OSD with no PGs.

I realise the obvious answer here is don't leave big cluster in an unclean state for this long. Currently we've got PGs that have been remapped for 5 days, which matches the 30,000 OSDMap epoch range perfectly. This is something we're always looking at from a procedure point of view e.g. keeping max_backfills as high as possible by default, ensuring balancer max_misplaced is appropriate, re-evaluating disk and node addition/removal processes. But the reality on this cluster is that sometimes these 'logjams' happen, and it would be good to understand if we can improve the OSD addition experience so we can continue to be flexible with our operation scheduling.

The first thing I noted was the OSD block devices aren't busy during the OSDmap fetching process - they're barely doing 50MB/s and 50 wr/s. I started looking into raising 'osd_map_share_max_epochs' to hopefully increase the number of maps shared with the new OSD per request and improve the rate, but I balked a bit after realising I would have to do this across the whole cluster (I think, anyway, not actually sure where the maps are being pulled from at this point). All tuning of this value I could see talked about reducing this value which further scared me.

Additionally, there's clearly some interplay between 'osd_map_cache_size' and 'osd_map_message_max' to consider. These historic maps must be being pulled from disk in general (be it osd or mon), so it shouldn't make a difference if osd_map_share_max_epochs > osd_map_cache_size, but in general I suppose you don't want OSDs having to grab maps off disk for requests from peers?

(There may also be a completely different dominating factor of the time to download and store the maps that I'm not considering here.)

So, any advice on improving the speed of the OSDmap download for fresh OSDs would be appreciated, or any other thoughts about this situation.

Thanks,
Tom
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx