It went from normal osdmap range 500-1000 maps to 30,000 maps in 5 days? That seems like excessive accumulation to me in a 5 day period. Respectfully, *Wes Dillingham* LinkedIn <http://www.linkedin.com/in/wesleydillingham> wes@xxxxxxxxxxxxxxxxx On Tue, Jan 7, 2025 at 1:18 PM Thomas Byrne - STFC UKRI < tom.byrne@xxxxxxxxxx> wrote: > Hi all, > > On our 6000+ HDD OSD cluster (pacific), we've been noticing takes > significantly longer for brand new OSDs to go from booting to active when > the cluster has been in a state of flux for some time. It can take over an > hour for a newly created OSD to be marked up in some cases! We've just put > up with it for some time, but I finally got annoyed enough with it to look > into it today... > > Looking at the logs of a new OSD when it's starting: > > 2025-01-07T13:44:05.534+0000 7f0b8b830700 3 osd.2016 5165598 > handle_osd_map epochs [5165599,5165638], i have 5165598, src has > [5146718,5175990] > 2025-01-07T13:44:08.988+0000 7f0b8d6ed700 10 osd.2016 5165638 msg say > newest map is 5175990, requesting more > 2025-01-07T13:44:08.990+0000 7f0b8b830700 3 osd.2016 5165638 > handle_osd_map epochs [5165639,5165678], i have 5165638, src has > [5146718,5175990] > 2025-01-07T13:44:12.391+0000 7f0b8d6ed700 10 osd.2016 5165678 msg say > newest map is 5175990, requesting more > 2025-01-07T13:44:12.394+0000 7f0b8b830700 3 osd.2016 5165678 > handle_osd_map epochs [5165679,5165718], i have 5165678, src has > [5146718,5175990] > 2025-01-07T13:44:16.047+0000 7f0b8d6ed700 10 osd.2016 5165718 msg say > newest map is 5175990, requesting more > > It's pulling down OSD maps, 40 at a time, taking about 4 seconds each > time. With the ~30,000(!) OSD maps it pulls down, it takes approximately an > hour. At ~4MB a map, this then matches up with the ~115GB storage consumed > by the resulting OSD with no PGs. > > I realise the obvious answer here is don't leave big cluster in an unclean > state for this long. Currently we've got PGs that have been remapped for 5 > days, which matches the 30,000 OSDMap epoch range perfectly. This is > something we're always looking at from a procedure point of view e.g. > keeping max_backfills as high as possible by default, ensuring balancer > max_misplaced is appropriate, re-evaluating disk and node addition/removal > processes. But the reality on this cluster is that sometimes these > 'logjams' happen, and it would be good to understand if we can improve the > OSD addition experience so we can continue to be flexible with our > operation scheduling. > > The first thing I noted was the OSD block devices aren't busy during the > OSDmap fetching process - they're barely doing 50MB/s and 50 wr/s. I > started looking into raising 'osd_map_share_max_epochs' to hopefully > increase the number of maps shared with the new OSD per request and improve > the rate, but I balked a bit after realising I would have to do this across > the whole cluster (I think, anyway, not actually sure where the maps are > being pulled from at this point). All tuning of this value I could see > talked about reducing this value which further scared me. > > Additionally, there's clearly some interplay between 'osd_map_cache_size' > and 'osd_map_message_max' to consider. These historic maps must be being > pulled from disk in general (be it osd or mon), so it shouldn't make a > difference if osd_map_share_max_epochs > osd_map_cache_size, but in general > I suppose you don't want OSDs having to grab maps off disk for requests > from peers? > > (There may also be a completely different dominating factor of the time to > download and store the maps that I'm not considering here.) > > So, any advice on improving the speed of the OSDmap download for fresh > OSDs would be appreciated, or any other thoughts about this situation. > > Thanks, > Tom > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx