Re: Slow initial boot of OSDs in large cluster with unclean state

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Thu, 9 Jan 2025 14:32:02 +0100 (CET)

Hi Tom,

Great talk there!

Since your cluster must be one of the largest in the world, it would be nice to share your experience with the community as a case study [1]. The Ceph project is looking for contributors right now.
If interested, let me know and we'll see how we can organize that.

I couldn't find how many MONs you're running in that big cluster. Hopefully 5 MONs.

You said OSDs have collocated WAL/DBs on HDDs. Have you tried running OSDs with WAL/DBs on NVMes?

I'm wondering about the influence of WAL/DBs collocated on HDDs on OSD creation time, OSD startup time, peering and osdmap updates, and the role it might play regarding flapping, when DB IOs compete with client IOs, even with 100% active+clean PGs.

Cheers,
Frédéric.

[1] https://ceph.io/en/discover/case-studies/

----- Le 8 Jan 25, à 16:10, Thomas Byrne, STFC UKRI tom.byrne@xxxxxxxxxx a écrit :

> Hi Frédéric,
> 
> All of our recent OSD crashes can be attributed to genuine hardware issues (i.e.
> failed IO due to unreadable sectors). For reference I've had a look and it
> looks like we've had a handful of drive failures on this cluster in the past
> month, with no other significant flapping. I was trying to say doesn't take
> many drive failures, combined with the balancer running to result in a
> persistent level of OSDMap churn.
> 
> Storage nodes are all some derivative of a 24 bay, 2U chassis (e.g 760XD2)
> Single 25Gig connection, no jumboframes
> HDDs range from 12-20TB SAS HDDs depending on year purchased, with collocated
> WAL/DBs on the HDDs.
> All BlueStore OSDs
> Mons have dedicated flash devices for their stores
> 
> The workload is radosstriper access to EC pools, so very limited metadata
> requirements (hence the lack of flash for OSDs). More info on the workload
> details can be seen on an very old talk of mine from a Ceph day [1].
> 
> [1] https://indico.cern.ch/event/765214/contributions/3517140/
> 
> ________________________________________
> From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
> Sent: Wednesday, January 8, 2025 12:59
> To: Byrne, Thomas (STFC,RAL,SC) <tom.byrne@xxxxxxxxxx>
> Cc: Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx>; ceph-users <ceph-users@xxxxxxx>
> Subject: Re:  Re: Slow initial boot of OSDs in large cluster with
> unclean state
> 
> 
> Hi Tom,
> 
> Could you describe this cluster from a hardware perspective? Network speed and
> MTU size, HDD type and capacity, whether OSDs have their WAL/DB on SSD/NVMe or
> if they're collocated, whether MONs are using HDDs or SSDs/NVMe, what workloads
> this cluster is handling?
> 
> You mentioned OSD flapping. This phenomenon should no longer occur today on any
> cluster, or very rarely, only in cases of actual hardware failure or when
> hardware is undersized relative to the workloads. All your OSDs are using
> Bluestore, correct?
> 
> Regards,
> Frédéric.
> 
> ----- Le 8 Jan 25, à 12:29, Thomas Byrne - STFC UKRI tom.byrne@xxxxxxxxxx a
> écrit :
> 
>> Hi Wes,
>>
>> It works out at about five new osdmaps a minute, which is about normal for this
>> cluster's state changes as far as I can tell. It'll drop down to 2-3
>> maps/minute during quiet periods, but the combination of the upmap balancer
>> making changes and occasional OSD flaps or crashes due to hardware issues is
>> enough to cause a fairly reliable rate of osdmap churn.
>>
>> This churn is something that we are working on understanding, and reducing where
>> possible, now that we know becoming a pain point for us.
>>
>> Thanks,
>> Tom
>>
>> ________________________________
>> From: Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx>
>> Sent: Tuesday, January 7, 2025 18:41
>> To: Byrne, Thomas (STFC,RAL,SC) <tom.byrne@xxxxxxxxxx>
>> Cc: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
>> Subject: Re:  Slow initial boot of OSDs in large cluster with
>> unclean state
>>
>> It went from normal osdmap range 500-1000 maps to 30,000 maps in 5 days? That
>> seems like excessive accumulation to me in a 5 day period.
>>
>> Respectfully,
>>
>> Wes Dillingham
>> LinkedIn<http://www.linkedin.com/in/wesleydillingham>
>> wes@xxxxxxxxxxxxxxxxx<mailto:wes@xxxxxxxxxxxxxxxxx>
>>
>>
>>
>>
>> On Tue, Jan 7, 2025 at 1:18 PM Thomas Byrne - STFC UKRI
>> <tom.byrne@xxxxxxxxxx<mailto:tom.byrne@xxxxxxxxxx>> wrote:
>> Hi all,
>>
>> On our 6000+ HDD OSD cluster (pacific), we've been noticing takes significantly
>> longer for brand new OSDs to go from booting to active when the cluster has
>> been in a state of flux for some time. It can take over an hour for a newly
>> created OSD to be marked up in some cases! We've just put up with it for some
>> time, but I finally got annoyed enough with it to look into it today...
>>
>> Looking at the logs of a new OSD when it's starting:
>>
>> 2025-01-07T13:44:05.534+0000 7f0b8b830700  3 osd.2016 5165598 handle_osd_map
>> epochs [5165599,5165638], i have 5165598, src has [5146718,5175990]
>> 2025-01-07T13:44:08.988+0000 7f0b8d6ed700 10 osd.2016 5165638  msg say newest
>> map is 5175990, requesting more
>> 2025-01-07T13:44:08.990+0000 7f0b8b830700  3 osd.2016 5165638 handle_osd_map
>> epochs [5165639,5165678], i have 5165638, src has [5146718,5175990]
>> 2025-01-07T13:44:12.391+0000 7f0b8d6ed700 10 osd.2016 5165678  msg say newest
>> map is 5175990, requesting more
>> 2025-01-07T13:44:12.394+0000 7f0b8b830700  3 osd.2016 5165678 handle_osd_map
>> epochs [5165679,5165718], i have 5165678, src has [5146718,5175990]
>> 2025-01-07T13:44:16.047+0000 7f0b8d6ed700 10 osd.2016 5165718  msg say newest
>> map is 5175990, requesting more
>>
>> It's pulling down OSD maps, 40 at a time, taking about 4 seconds each time. With
>> the ~30,000(!) OSD maps it pulls down, it takes approximately an hour. At ~4MB
>> a map, this then matches up with the ~115GB storage consumed by the resulting
>> OSD with no PGs.
>>
>> I realise the obvious answer here is don't leave big cluster in an unclean state
>> for this long. Currently we've got PGs that have been remapped for 5 days,
>> which matches the 30,000 OSDMap epoch range perfectly. This is something we're
>> always looking at from a procedure point of view e.g. keeping max_backfills as
>> high as possible by default, ensuring balancer max_misplaced is appropriate,
>> re-evaluating disk and node addition/removal processes. But the reality on this
>> cluster is that sometimes these 'logjams' happen, and it would be good to
>> understand if we can improve the OSD addition experience so we can continue to
>> be flexible with our operation scheduling.
>>
>> The first thing I noted was the OSD block devices aren't busy during the OSDmap
>> fetching process - they're barely doing 50MB/s and 50 wr/s. I started looking
>> into raising 'osd_map_share_max_epochs' to hopefully increase the number of
>> maps shared with the new OSD per request and improve the rate, but I balked a
>> bit after realising I would have to do this across the whole cluster (I think,
>> anyway, not actually sure where the maps are being pulled from at this point).
>> All tuning of this value I could see talked about reducing this value which
>> further scared me.
>>
>> Additionally, there's clearly some interplay between 'osd_map_cache_size' and
>> 'osd_map_message_max' to consider. These historic maps must be being pulled
>> from disk in general (be it osd or mon), so it shouldn't make a difference if
>> osd_map_share_max_epochs > osd_map_cache_size, but in general I suppose you
>> don't want OSDs having to grab maps off disk for requests from peers?
>>
>> (There may also be a completely different dominating factor of the time to
>> download and store the maps that I'm not considering here.)
>>
>> So, any advice on improving the speed of the OSDmap download for fresh OSDs
>> would be appreciated, or any other thoughts about this situation.
>>
>> Thanks,
>> Tom
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
>> To unsubscribe send an email to
>> ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx