Re: What is the problem with many PGs per OSD

Frank Schilder <frans@xxxxxx> · Thu, 10 Oct 2024 08:53:08 +0000

Hi Greg,

thanks for chiming in here.

> ... presumably because the current sizing guidelines are generally good enough to be getting on with ...

That's exactly why I'm bringing this up with such insistence. The guidelines are *not* good enough for EC pools on large HDDs that store a high percentage of small objects, in our case, files. In fact, they are really bad in that case and there were a number of recent ceph-user threads where a significant increase in PG count would probably help a lot. Including problems caused by very high object count per PG I'm dancing around on our cluster.

For the guessing where the recommendation comes from, I'm actually leaning towards the "PGs should be limited in size" explanation. The recommendation of 100 PGs per OSD was good enough for a very long time together with bugs or observations you mention where it was never really assessed what the actual cause was or what resources are actually needed per PG compared with the total object count per OSD.

PGs were originally invented to chunk up large disks for distributed RAID. To keep all-to-all rebuild time constant independent of the scale of the cluster. That's how you get scale-out capability. A fixed PG count counteracts that with the insane increase of capacity per disk we have lately. That's why I actually lean towards that the recommendation was intended to keep PGs below 5-10G each (and or <N objects) and was never updated with hardware developments.

I have serious problems seeing how the PG count could be a single number screwing a cluster up. Peering, recovery, rocksdb size, everything is tied to the object count of an OSD. PGs just split this up into smaller units that are easier to manage. As a principle, for *any* problem with non-linear complexity (greater than linear complexity), solving M problems of size N/M is easier than solving 1 problem of size N. So, increasing the PG count should *improve* things just out of this principle. Unless there is a serious implementation problem I really don't understand why anyone would claim the opposite. If there is such an implementation problem, please anyone come forward.

So I question here the anecdotal reports about the PG count being to blame alone. There have been a number of bugs discovered that were triggered by PG splitting. That these bugs are more likely hit when using high PG counts is kind of obvious. So its per se not the PG count that's the problem.

Testing and experiments could be useful to update the guidelines. However, a good look at the code of a PG code maintainer would probably be faster and if there is something problematic it would be better to refer to the code than to experiments that might have missed the critical section. So the question really is, is there a piece of code that is more than quadratic in the PG count in any resource? Worse yet, is there something exponential? If there is something like that, there is no point making experiments.

If there is nothing like that in the code, its worth conducting experiments and provide a table with resource usage depending on PG count. That would be very much appreciated.

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Gregory Farnum <gfarnum@xxxxxxxxxx>
Sent: Thursday, October 10, 2024 10:19 AM
To: Frank Schilder
Cc: Janne Johansson; Anthony D'Atri; ceph-users@xxxxxxx
Subject: Re:  Re: What is the problem with many PGs per OSD

Yes, this was an old lesson and AFAIK nobody has intentionally pushed the bounds in a long time because it was a very painful lesson for anybody who ran into it.

The main problem was the increase in ram use scaling with PGs, which in normal operation is often fine but as we all know balloons in failure conditions.

There are many developments that may have made things behave better, but early on some clusters just couldn’t be recovered until they received double their starting ram and were babysat through careful manually-orchestrated startup. (Or maybe worse — I forget.)

Nobody’s run experiments, presumably because the current sizing guidelines are generally good enough to be getting on with, for anybody who has the resources to try and engage in the measurement work it would take to re-validate them. I will be surprised if anybody has information of the sort you seem to be searching for.
-Greg

On Thu, Oct 10, 2024 at 12:13 AM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:
Hi Janne.

> To be fair, this number could just be something vaguely related to
> "spin drives have 100-200 iops" ...

It could be, but is it? Or is it just another rumor? I simply don't see how the PG count could possibly impact Io load on a disk.

How about this guess: It could be dragged along from a time when HDDs were <=1T and it simply means to have PGs not larger than 10G. Sounds reasonable, but is it?

I think we should really stop second-guessing here. This discussion was not meant to be a long thread where we all just guess but never know. I would appreciate if someone who actually knows something about why this recommendation is really there would ship in here. As far as I can tell, it could be anything or nothing. I actually tend toward its nothing, it was just never updated along with new developments and nowadays nobody knows any more.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Janne Johansson <icepic.dz@xxxxxxxxx<mailto:icepic.dz@xxxxxxxxx>>
Sent: Thursday, October 10, 2024 8:51 AM
To: Frank Schilder
Cc: Anthony D'Atri; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
Subject: Re:  Re: What is the problem with many PGs per OSD

Den ons 9 okt. 2024 kl 20:48 skrev Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>>:

> The PG count per OSD is a striking exception. Its just a number (well a range with 100 recommended and 200 as a max: https://docs.ceph.com/en/latest/rados/operations/pgcalc/#keyDL). It just is. And this doesn't make any sense unless there is something really evil lurking in the dark.
> For comparison, a guidance that does make sense is something like 100PGs per TB. That I would vaguely understand: to keep the average PG size constant at a max of about 10G.

To be fair, this number could just be something vaguely related to
"spin drives have 100-200 iops", and while cent/rhel linux kernels 10
years ago did have some issues in getting io done in parallel as much
as possible towards a single device, doing multiple OSDs on flash
devices would have been both a way to get around this limitation in
the IO middle layer, and a way to "tell" ceph it can send more IO to
the device since it has multiple OSDs on it.

--
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx