Re: What is the problem with many PGs per OSD

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Thu, 10 Oct 2024 11:14:35 -0400

> but simply on the physical parameter of IOPS-per-TB (a "figure of merit" that
> is widely underestimate or ignored)

hear hear!

> of HDDs, and having enough IOPS-per-TB to sustain both user and admin workload.

Even with SATA SSDs I twice had to expand a cluster to meet SLO long before it was nearly full.  The SNIA TCO calculator includes a multiplier for number of drives one has to provision for semi-acceptable IOPs.

> A couple of legacy Ceph instances I saw in the past had 8TB and
> 18TB drives and as they got full the instances basically
> congealed (latencies in the several seconds or even dozens of
> second range) even under modest user workloads, and anyhow
> expensive admin workloads like scrubbing (never mind deep
> scrubbing) got behind by a year or two, and rebalancing was
> nearly impossible. Again not because of Ceph.

Been there, ITSY’d.  Fragmentation matters with rotational media, even with op re-ordering within the drive or the driver.

> But that is completely different: SSDs have *much* higher IOPS,
> even SATA ones, so even large SSDs have enormously better
> IOPS-per-TB.

And IOPS-per-yourlocalcurrency.  Coarse-IU QLC is a bit of a wrinkle depending on workload...

>> I would like to point out that there are scale-out storage
>> systems that have adopted their architecture for this scenario
>> and use large HDDs very well.
> 
> That is *physically impossible* as they just do not have enough
> IOPS-per-TB for many "live" workloads. The illusion that they
> might work well happens in one of two cases:
> 
> * Either because they have not filled up yet,

I saw this with RGW on ultradense HDD toploaders.  

> or because they
>  have filled up but only a minuscule subset of the data is in
>  active use, the IOPS-per-*active*-TB of the user workload is
>  still good enough.

Archival workloads - sure.  Sometimes even backups.  Even then, prudently-sourced QLC often has superior TCO compared to spinners.

> * If the *active data* is mostly read-only and gets cached on a
>  SSD tier of sufficient size, and admin workload does not
>  matter.

And sometimes when that data active because of full backups, that process effectively flushes the cache to boot.

> I have some idea of how Qumulo does things and that is very
> unlikely, Ceph is not fundamentally inferior to their design.
> Perhaps the workload's anisotropy matches particularly well that
> of that particular Qumulo instance:

Like a DB that’s column-oriented vs row-oriented?

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx