Re: What is the problem with many PGs per OSD

Frank Schilder <frans@xxxxxx> · Thu, 10 Oct 2024 11:55:51 +0000

Hi Peter,

thanks for your comment. So it is mainly related to PG size. Unfortunately, we need to have a reality check here:

> It was good for the intended use case, lots of small (by today's
> standards, around 1TB) OSDs on many servers working in parallel.
>
> Note: HDDs larger than 1TB are not really suitable for
> significant parallel user workloads and most admin workloads:
> http://www.sabi.co.uk/blog/17-one.html?170610#170610

I'm afraid nobody will build a 100PB cluster with 1T drives. That's just absurd. So, the sharp increase of per-device capacity has to be taken into account. Specifically as the same development is happening with SSDs. There is no way around 100TB drives in the near future and a system like ceph is either able to handle that or will die. I would like to point out that there are scale-out storage systems that have adopted their architecture for this scenario and use large HDDs very well.

For comparison, our University operates an all-HDD qumolo cluster that handles the administrative and user/student storage for the entire University and has about a factor 10 higher aggregated sustained IOP/s performance compared with a similarly sized ceph cluster (FS performance). Using the same EC redundancy, 16TB HDDs, having hundreds of snapshots on the file system and mirroring to a secondary site - all that basically on the same server configs we use for ceph. My best guess is that there is a tiering between replicated front-end- and EC back-end storage going on internally. But who knows, its closed source. However, it shows that doing something "good enough" with large drives is very well possible.

The main problem with the high admin workload on ceph is that user data is directly mapped onto RADOS level objects. That was maybe a good idea back in the days, however, it has become a heavy legacy now, because all admin operations happen on the small user objects instead of aggregated objects that are much easier to operate on.

For now I understand your comment in summary as:

- Yes, 200G PGs are insane, they should be a lot smaller.
- For large PGs the meta-data workload is actually higher than for small PGs (kind of what I also argue about)
  and its more efficient to rebuild and keep track of redundancy.
- You also don't know why the recommendation today is 100-200 per OSD fixed except that it was suitable for 1TB drives.
- You also can't answer what will happen if one goes for 100-200 PGs per TB, meaning 1600-3200 PGs per 16TB drive.

So my main question, the last one, is still looking for an answer.

Thanks for your comment and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Peter Grandi <pg@xxxxxxxxxxxxxxxxxxxx>
Sent: Thursday, October 10, 2024 1:01 PM
To: list Linux fs Ceph
Subject:  Re: What is the problem with many PGs per OSD

>>> On Thu, 10 Oct 2024 08:53:08 +0000, Frank Schilder <frans@xxxxxx> said:

> The guidelines are *not* good enough for EC pools on large
> HDDs that store a high percentage of small objects, in our
> case, files.

Arguably *nothing* is good enough for that, because it is the
worst possible case scenario (A Ceph instance I inherited was
like that). Ceph was designed to have large swarms of small
OSSes with a 1 (or at most a few) small devices each. Having
lots of small OSDs on many servers is a critical assumption of
many aspects of Ceph (even if the opposite can be made to
"workish").

This is very right though:

> In fact, they are really bad in that case and there were a
> number of recent ceph-user threads where a significant
> increase in PG count would probably help a lot.

> Including problems caused by very high object count per PG I'm
> dancing around on our cluster.

The main problem is not so much high object count per PG, it is
that large PGs (in the legacy case above around 200GB) have to
be "rebalanced" *in their entirety*, or go "damaged" again as a
whole, impacting enormous numbers of objects. This happens
whether their pool/profile is EC or replicas.

Which point is part of my general observation that all systems
have both a user and a background admin workload (which for
storage tends to be whole-instance scans), and many are only
sized to sustain the user workload, and Ceph as self-healing and
self-balancing has a particularly high admin workload (resync,
rebalance, scrub, "backup", ...).

> For the guessing where the recommendation comes from, I'm
> actually leaning towards the "PGs should be limited in size"
> explanation. The recommendation of 100 PGs per OSD was good
> enough for a very long time

It was good for the intended use case, lots of small (by today's
standards, around 1TB) OSDs on many servers working in parallel.

Note: HDDs larger than 1TB are not really suitable for
significant parallel user workloads and most admin workloads:
http://www.sabi.co.uk/blog/17-one.html?170610#170610

> PGs were originally invented to chunk up large disks for
> distributed RAID. To keep all-to-all rebuild time constant
> independent of the scale of the cluster.

That I guess was a secondary goal: the main goal I think was to
reduce the size of metadata keeping track of redundancy (whether
replicas or EC) from every-object-shard to every-PG, from a one
level table to a two-level (list of PGs, list of object shards
in a PG) table.

The choice is then to have a larger numbers of PGs or to have a
larger numbers of objects per PG as the number of total objects
increases, and I too prefer the larger number of PGs.

> So I question here the anecdotal reports about the PG count
> being to blame alone. There have been a number of bugs
> discovered that were triggered by PG splitting.

I remember the legacy (Pacific) Ceph instance that I had
inherited did have lots of problems because of that.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx