Re: Slow initial boot of OSDs in large cluster with unclean state

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Thu, 23 Jan 2025 11:44:54 +0100 (CET)

----- Le 23 Jan 25, à 9:51, Gregory Orange gregory.orange@xxxxxxxxxxxxx a écrit :

> I'm watching this thread with interest, for a few reasons. We have
> benefited a lot(!) from advice from various people in it over the years,
> there are some similarities between our setup and STFC's, and we haven't
> had been bothered by this issue so far. So, I'm intrigued.
> 
> On 8/1/25 23:10, Thomas Byrne - STFC UKRI wrote:
>> Storage nodes are all some derivative of a 24 bay, 2U chassis (e.g 760XD2)
>> Single 25Gig connection, no jumboframes
>> HDDs range from 12-20TB SAS HDDs depending on year purchased, with collocated
>> WAL/DBs on the HDDs.
>> All BlueStore OSDs
>> Mons have dedicated flash devices for their stores
> 
> Our nodes are 740XD2 with 24x 16TB HDDs plus SSD for RocksDB, and 100Gb
> with jumbo frames (9000 MTU). Quincy deb pkgs on Ubuntu 20.04, so we're
> planning for some necessary upgrade work. We have
> target_max_misplaced_ratio 0.3% since the last time we had a lot of
> backfilling (aided by upmap tools) that value left us with plenty of
> performance capacity for users. We're at 75-80% full with some capacity
> for growth, and it is operating fine.
> 
> Sometimes starting an OSD can take up to 20 minutes,

Hey Gregory,

Were these OSDs down for a long period of time? Since Bluestore and even in filestore, I've never seen an OSD taking that long to start.
And we use the same hardware as you (730xd, 740xd 760xd2) with RocksDBs on SSD and/or NVMes and HDDs of 4TB,8TB and 16TB.

I wonder if/how this is influenced by the number of OSDs in the cluster (we 'only' have 600 OSDs). I mean, it probably is but, but to that point... I'm surprised.

Cheers,
Frédéric.

> so there may be
> some shared experience there. However, apart from a harrowing period
> last year[1] we live in HEALTH_OK most of the time.
> 
> We also don't schedule the balancer to ever be off, because it is often
> pretty quiet. Typical output from our two clusters:
> 
>    pgs:     32591 active+clean
>             588   active+clean+scrubbing+deep
>             21    active+clean+scrubbing
> 
> The big one has just shy of 3000 OSDs, which is half of Thomas' cluster.
> Perhaps that is a key difference. Our hardware failure though is
> markedly less than half the rate. I think we had one in December, and
> none this year so far.
> 
> [1]
> https://ceph2024.sched.com/event/1ktWK/get-that-cluster-back-online-but-hurry-slowly-gregory-orange-pawsey-supercomputing-centre
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx