Re: Slow initial boot of OSDs in large cluster with unclean state

Thomas Byrne - STFC UKRI <tom.byrne@xxxxxxxxxx> · Wed, 8 Jan 2025 15:11:24 +0000

Hi Anthony,

Please see my replies inline. I also just wanted to say I really enjoyed your talk about QLC flash at Cephalocon, there was a lot of useful info in there.

>> On our 6000+ HDD OSD cluster (pacific)
>
>That’s the bleeding edge in a number of respects.  Updating to at least Reef would bring various improvements, and I have some suggestions I'd like to*love* to run by you wrt upgrade speed in such a cluster, if you’re using cephadm / ceph orch.  Would also love offline to get a copy of any non-default values you’ve found valuable.

Upgrades are planned, but we're currently on rocky8 with rpms, so there's a few steps before we can get to Reef unfortunately. Currently the plan is to migrate to cephadm post-quincy, so I'll be sure to get in touch once we're there. One of the big concerns about moving this cluster to cephadm has been the orchestrator performance at scale, so I'm happy to hear that you're interested!

For more general Ceph tuning, the OSDs are fairly happy with the defaults, I assume because they are still seeing similar numbers of peers, irrespective of total OSD count. The main things are the monitors and managers (not unexpectedly) start having harder times getting everything done in the default intervals, so increased mon_lease and mgr_stats_periods help prevent election loops on map creation, and monitoring artifacts respectively. We've also had to rescale the crush weights somewhere around 80PB. Other than that it largely just works as intended, which is very cool.

>
>> we've been noticing takes significantly longer for brand new OSDs to go from booting to active when the cluster has been in a state of flux for some time. It can take over an hour for a newly created OSD to be marked up in some cases!
>
>That’s pretty extreme.
>
>> We've just put up with it for some time, but I finally got annoyed enough with it to look into it today...
>>
>> Looking at the logs of a new OSD when it's starting:
>>
>> 2025-01-07T13:44:05.534+0000 7f0b8b830700  3 osd.2016 5165598 handle_osd_map epochs [5165599,5165638], i have 5165598, src has [5146718,5175990]
>> 2025-01-07T13:44:08.988+0000 7f0b8d6ed700 10 osd.2016 5165638  msg say newest map is 5175990, requesting more
>> 2025-01-07T13:44:08.990+0000 7f0b8b830700  3 osd.2016 5165638 handle_osd_map epochs [5165639,5165678], i have 5165638, src has [5146718,5175990]
>> 2025-01-07T13:44:12.391+0000 7f0b8d6ed700 10 osd.2016 5165678  msg say newest map is 5175990, requesting more
>> 2025-01-07T13:44:12.394+0000 7f0b8b830700  3 osd.2016 5165678 handle_osd_map epochs [5165679,5165718], i have 5165678, src has [5146718,5175990]
>> 2025-01-07T13:44:16.047+0000 7f0b8d6ed700 10 osd.2016 5165718  msg say newest map is 5175990, requesting more
>>
>> It's pulling down OSD maps, 40 at a time, taking about 4 seconds each time. With the ~30,000(!) OSD maps it pulls down, it takes approximately an hour. At ~4MB a map, this then matches up with the ~115GB storage consumed by the resulting OSD with no PGs.
>>
>> I realise the obvious answer here is don't leave big cluster in an unclean state for this long.
>
>Patient:  It turns when I do this
>Doctor:  Don’t do that.

:D

>
>> Currently we've got PGs that have been remapped for 5 days, which matches the 30,000 OSDMap epoch range perfectly.
>
>Are you adding OSDs while the cluster is converging / backfilling / recovering?

Yes - in general the cluster is always doing something, we have a fairly tight schedule of 'cluster time' (as we call it) for hardware addition/removal, rolling reboots and patching, letting the balancer catch up, etc. Although we try and schedule quiet periods before larger interventions, hardware problems often don't cooperate with this!

>> This is something we're always looking at from a procedure point of view e.g. keeping max_backfills as high as possible by default, ensuring balancer max_misplaced is appropriate, re-evaluating disk and node addition/removal processes. But the reality on this cluster is that sometimes these 'logjams' happen, and it would be good to understand if we can improve the OSD addition experience so we can continue to be flexible with our operation scheduling.
>
>I might speculate that putting off OSD addition until the cluster is converged might help, maybe then a rolling mon compaction in advance of adding OSDs.

This makes sense, we treat OSD re-additions after drive replacements as an ongoing process that we don't schedule around the larger operations, and in fact have been a lot more proactive about getting drives back in recently, perhaps this is actually hindering us here, and a batched approach to OSD re-addition would suit us better.

>
>> The first thing I noted was the OSD block devices aren't busy during the OSDmap fetching process - they're barely doing 50MB/s and 50 wr/s. I started looking into raising 'osd_map_share_max_epochs' to hopefully increase the number of maps shared with the new OSD per request and improve the rate, but I balked a bit after realising I would have to do this across the whole cluster (I think, anyway, not actually sure where the maps are being pulled from at this point).
>
>From the mons I think.
>
>> All tuning of this value I could see talked about reducing this value which further scared me.
>>
>> Additionally, there's clearly some interplay between 'osd_map_cache_size' and 'osd_map_message_max' to consider. These historic maps must be being pulled from disk in general (be it osd or mon),
>
>Are your mon DBs on spinners?

They're on dedicated SSDs. If the maps are being pulled from the mons for the initial boot, I'm a little less keen to try and fiddle with this, dosing the mons when you add a few hosts worth of OSDs sounds like something to avoid.

>
>> so it shouldn't make a difference if osd_map_share_max_epochs > osd_map_cache_size, but in general I suppose you don't want OSDs having to grab maps off disk for requests from peers?
>>
>> (There may also be a completely different dominating factor of the time to download and store the maps that I'm not considering here.)
>>
>> So, any advice on improving the speed of the OSDmap download for fresh OSDs would be appreciated, or any other thoughts about this situation.
>>
>> Thanks,
>> Tom
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx