Re: What is the problem with many PGs per OSD

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 10 Oct 2024 01:19:28 -0700

Yes, this was an old lesson and AFAIK nobody has intentionally pushed the
bounds in a long time because it was a very painful lesson for anybody who
ran into it.

The main problem was the increase in ram use scaling with PGs, which in
normal operation is often fine but as we all know balloons in failure
conditions.

There are many developments that may have made things behave better, but
early on some clusters just couldn’t be recovered until they received
double their starting ram and were babysat through careful
manually-orchestrated startup. (Or maybe worse — I forget.)

Nobody’s run experiments, presumably because the current sizing guidelines
are generally good enough to be getting on with, for anybody who has the
resources to try and engage in the measurement work it would take to
re-validate them. I will be surprised if anybody has information of the
sort you seem to be searching for.
-Greg

On Thu, Oct 10, 2024 at 12:13 AM Frank Schilder <frans@xxxxxx> wrote:

> Hi Janne.
>
> > To be fair, this number could just be something vaguely related to
> > "spin drives have 100-200 iops" ...
>
> It could be, but is it? Or is it just another rumor? I simply don't see
> how the PG count could possibly impact Io load on a disk.
>
> How about this guess: It could be dragged along from a time when HDDs were
> <=1T and it simply means to have PGs not larger than 10G. Sounds
> reasonable, but is it?
>
> I think we should really stop second-guessing here. This discussion was
> not meant to be a long thread where we all just guess but never know. I
> would appreciate if someone who actually knows something about why this
> recommendation is really there would ship in here. As far as I can tell, it
> could be anything or nothing. I actually tend toward its nothing, it was
> just never updated along with new developments and nowadays nobody knows
> any more.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Janne Johansson <icepic.dz@xxxxxxxxx>
> Sent: Thursday, October 10, 2024 8:51 AM
> To: Frank Schilder
> Cc: Anthony D'Atri; ceph-users@xxxxxxx
> Subject: Re:  Re: What is the problem with many PGs per OSD
>
> Den ons 9 okt. 2024 kl 20:48 skrev Frank Schilder <frans@xxxxxx>:
>
> > The PG count per OSD is a striking exception. Its just a number (well a
> range with 100 recommended and 200 as a max:
> https://docs.ceph.com/en/latest/rados/operations/pgcalc/#keyDL). It just
> is. And this doesn't make any sense unless there is something really evil
> lurking in the dark.
> > For comparison, a guidance that does make sense is something like 100PGs
> per TB. That I would vaguely understand: to keep the average PG size
> constant at a max of about 10G.
>
> To be fair, this number could just be something vaguely related to
> "spin drives have 100-200 iops", and while cent/rhel linux kernels 10
> years ago did have some issues in getting io done in parallel as much
> as possible towards a single device, doing multiple OSDs on flash
> devices would have been both a way to get around this limitation in
> the IO middle layer, and a way to "tell" ceph it can send more IO to
> the device since it has multiple OSDs on it.
>
> --
> May the most significant bit of your life be positive.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx