Re: What is the problem with many PGs per OSD

pg@xxxxxxxxxxxxxxxxxxxx (Peter Grandi) · Thu, 10 Oct 2024 15:57:27 +0100

> [... number of PGs per OSD ...]

> So it is mainly related to PG size.

Indeed and secondarily number of objects: many objects per PG
mean lower metadata overhead, but bigger PGs mean higher admin
workload latency.

>> Note: HDDs larger than 1TB are not really suitable for
>> significant parallel user workloads and most admin workloads:
>> http://www.sabi.co.uk/blog/17-one.html?170610#170610
"How challenging is a goal of 18MB/s per TB of storage, and latency"

> I'm afraid nobody will build a 100PB cluster with 1T drives.

Well if they want it to be able to do a non-negligible user
workload plus the necessary admin workload too bad for them. My
post above does not depend at all on filesystem, but simply on
the physical parameter of IOPS-per-TB (a "figure of merit" that
is widely underestimate or ignored) of HDDs, and having enough
IOPS-per-TB to sustain both user and admin workload.

A couple of legacy Ceph instances I saw in the past had 8TB and
18TB drives and as they got full the instances basically
congealed (latencies in the several seconds or even dozens of
second range) even under modest user workloads, and anyhow
expensive admin workloads like scrubbing (never mind deep
scrubbing) got behind by a year or two, and rebalancing was
nearly impossible. Again not because of Ceph.

> That's just absurd. So, the sharp increase of per-device
> capacity has to be taken into account.

Indeed by taking into account that HDDs 4TB and above behave as
slow-random-access tapes.

> Specifically as the same development is happening with SSDs.

But that is completely different: SSDs have *much* higher IOPS,
even SATA ones, so even large SSDs have enormously better
IOPS-per-TB.

> I would like to point out that there are scale-out storage
> systems that have adopted their architecture for this scenario
> and use large HDDs very well.

That is *physically impossible* as they just do not have enough
IOPS-per-TB for many "live" workloads. The illusion that they
might work well happens in one of two cases:

* Either because they have not filled up yet, or because they
  have filled up but only a minuscule subset of the data is in
  active use, the IOPS-per-*active*-TB of the user workload is
  still good enough. The problem with that is that for many
  admin workloads a lot of the data or even or *all* the data
  becomes active.

* If the *active data* is mostly read-only and gets cached on a
  SSD tier of sufficient size, and admin workload does not
  matter.

> For comparison, our University operates an all-HDD qumolo
> cluster that handles the administrative and user/student
> storage for the entire University and has about a factor 10
> higher aggregated sustained IOP/s performance compared with a
> similarly sized ceph cluster (FS performance).

I have some idea of how Qumulo does things and that is very
unlikely, Ceph is not fundamentally inferior to their design.
Perhaps the workload's anisotropy matches particularly well that
of that particular Qumulo instance:
https://www.sabi.co.uk/blog/16-one.html?160322#160322
"Examples of anisotropy in Isilon and Qumulo"

> doing something "good enough" with large drives is very well
> possible.

It is possible to use large drives for "Glacier" style storage
with an SSD front tier. But "Glacier" style storage is not
trivial to design or manage.

https://storagemojo.com/2014/04/25/amazons-glacier-secret-bdxl/
https://storagemojo.com/2014/04/30/glacier-redux/
https://blog.dshr.org/2014/09/more-on-facebooks-cold-storage.html
https://www.theregister.com/2015/05/07/facebook_maid_gets_cold/
https://blog.dshr.org/2021/05/storage-update.html

> - You also don't know why the recommendation today is 100-200
>   per OSD fixed except that it was suitable for 1TB drives.
> - You also can't answer what will happen if one goes for
>   100-200 PGs per TB, meaning 1600-3200 PGs per 16TB drive.

> So my main question, the last one, is still looking for an answer.

The famous joke goes: a guy goes to the doctor and says "Doctor
when I stab my hand hard with a fork it really hurts a lots, how
can you fix that?" the doctor says "Do not do that". If you do
not like that answer, keep looking :-).

Put another way if the combined user+admin workload requires
N*100 IOPS-per-TB, using N IOPS-per-TB storage media is not
goint to give a happy experience.

But lots of people know better.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx