Re: Slow initial boot of OSDs in large cluster with unclean state

Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx> · Tue, 7 Jan 2025 13:41:37 -0500

It went from normal osdmap range 500-1000 maps to 30,000 maps in 5 days?
That seems like excessive accumulation to me in a 5 day period.

Respectfully,

*Wes Dillingham*
LinkedIn <http://www.linkedin.com/in/wesleydillingham>
wes@xxxxxxxxxxxxxxxxx

On Tue, Jan 7, 2025 at 1:18 PM Thomas Byrne - STFC UKRI <
tom.byrne@xxxxxxxxxx> wrote:

> Hi all,
>
> On our 6000+ HDD OSD cluster (pacific), we've been noticing takes
> significantly longer for brand new OSDs to go from booting to active when
> the cluster has been in a state of flux for some time. It can take over an
> hour for a newly created OSD to be marked up in some cases! We've just put
> up with it for some time, but I finally got annoyed enough with it to look
> into it today...
>
> Looking at the logs of a new OSD when it's starting:
>
> 2025-01-07T13:44:05.534+0000 7f0b8b830700  3 osd.2016 5165598
> handle_osd_map epochs [5165599,5165638], i have 5165598, src has
> [5146718,5175990]
> 2025-01-07T13:44:08.988+0000 7f0b8d6ed700 10 osd.2016 5165638  msg say
> newest map is 5175990, requesting more
> 2025-01-07T13:44:08.990+0000 7f0b8b830700  3 osd.2016 5165638
> handle_osd_map epochs [5165639,5165678], i have 5165638, src has
> [5146718,5175990]
> 2025-01-07T13:44:12.391+0000 7f0b8d6ed700 10 osd.2016 5165678  msg say
> newest map is 5175990, requesting more
> 2025-01-07T13:44:12.394+0000 7f0b8b830700  3 osd.2016 5165678
> handle_osd_map epochs [5165679,5165718], i have 5165678, src has
> [5146718,5175990]
> 2025-01-07T13:44:16.047+0000 7f0b8d6ed700 10 osd.2016 5165718  msg say
> newest map is 5175990, requesting more
>
> It's pulling down OSD maps, 40 at a time, taking about 4 seconds each
> time. With the ~30,000(!) OSD maps it pulls down, it takes approximately an
> hour. At ~4MB a map, this then matches up with the ~115GB storage consumed
> by the resulting OSD with no PGs.
>
> I realise the obvious answer here is don't leave big cluster in an unclean
> state for this long. Currently we've got PGs that have been remapped for 5
> days, which matches the 30,000 OSDMap epoch range perfectly. This is
> something we're always looking at from a procedure point of view e.g.
> keeping max_backfills as high as possible by default, ensuring balancer
> max_misplaced is appropriate, re-evaluating disk and node addition/removal
> processes. But the reality on this cluster is that sometimes these
> 'logjams' happen, and it would be good to understand if we can improve the
> OSD addition experience so we can continue to be flexible with our
> operation scheduling.
>
> The first thing I noted was the OSD block devices aren't busy during the
> OSDmap fetching process - they're barely doing 50MB/s and 50 wr/s. I
> started looking into raising 'osd_map_share_max_epochs' to hopefully
> increase the number of maps shared with the new OSD per request and improve
> the rate, but I balked a bit after realising I would have to do this across
> the whole cluster (I think, anyway, not actually sure where the maps are
> being pulled from at this point). All tuning of this value I could see
> talked about reducing this value which further scared me.
>
> Additionally, there's clearly some interplay between 'osd_map_cache_size'
> and 'osd_map_message_max' to consider. These historic maps must be being
> pulled from disk in general (be it osd or mon), so it shouldn't make a
> difference if osd_map_share_max_epochs > osd_map_cache_size, but in general
> I suppose you don't want OSDs having to grab maps off disk for requests
> from peers?
>
> (There may also be a completely different dominating factor of the time to
> download and store the maps that I'm not considering here.)
>
> So, any advice on improving the speed of the OSDmap download for fresh
> OSDs would be appreciated, or any other thoughts about this situation.
>
> Thanks,
> Tom
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx