Hi Frank,
Is this not checked per OSD? This would be really bad, because if it
just uses the average (currently 143.3) this warning will never be
triggered in critical situations.
I believe you're right, I can only remember having warnings about the
average pg count per OSD, not the absolute value. I'm not aware if
this is being worked on, maybe a tracker issue would be helpful here.
Regards,
Eugen
Zitat von Frank Schilder <frans@xxxxxx>:
Hi Eugen,
the PG merge finished and I still observe that no PG warning shows
up. We have
mgr advanced mon_max_pg_per_osd
300
and I have an OSD with 306 PGs. Still, no warning:
# ceph health detail
HEALTH_OK
Is this not checked per OSD? This would be really bad, because if it
just uses the average (currently 143.3) this warning will never be
triggered in critical situations. Our average PG count is dominated
by a huge HDD pool with ca. 120 PGs/OSD. We have a number of smaller
SSD pools where we go closer to the limit. The critical pool has 24
OSDs, and I would have to create thousands of PGs on these OSDs
before the average crosses the threshold. In other words, if its the
average PG count that is used, this warning is almost always too
late, because its the small pools where too many PGs end up on OSDs,
but these don't influence the average much.
I was always wondering how users ended up with more than 1000 PGs
per OSD by accident during recovery. It now makes more sense. If
there is no per-OSD warning, this can easily happen.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: 12 October 2022 17:11:02
To: Eugen Block
Cc: ceph-users@xxxxxxx
Subject: Re: How to force PG merging in one step?
Hi Eugen.
During recovery there's another factor involved
(osd_max_pg_per_osd_hard_ratio), the default is 3. I had to deal with
that a few months back when I got inactive PGs due to many chunks and
"only" a factor of 3. In that specific cluster I increased it to 5 and
didn't encounter inactive PGs anymore.
Yes, I looked at this as well and I remember cases where people got
stuck with temporary PG numbers being too high. This is precisely
why I wanted to see this warning. If its off during recovery, the
only way to notice that something is going wrong is when you hit the
hard limit. But then its too late.
I actually wanted to see this during recovery to have an early
warning sign. I purposefully did not increase pg_num_max to 500 to
make sure that warning shows up. I personally consider it really bad
behaviour if recovery/rebalancing disables this warning. Recovery is
the operation where exceeding a PG limit limit without knowing will
hurt most.
Thanks for the heads up. Probably need to watch my * a bit more with
certain things.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx