Do you have two Prometheus instances? Maybe you could share
ceph orch ls prometheus --export
Or alternatively:
ceph orch ps --daemon-type prometheus
You can use two instances for HA, but then you need to change the
threshold for both, of course.
Zitat von "Devin A. Bougie" <devin.bougie@xxxxxxxxxxx>:
Thanks, Eugen! Just incase you have any more suggestions, this
still isn’t quite working for us.
Perhaps one clue is that in the Alerts view of the cephadm
dashboard, every alert is listed twice. We see two CephPGImbalance
alerts, both set to 30% after redeploying the service. If I then
follow your procedure, one of the alerts updates to 50% as
configured, but the other stays at 30. Is it normal to see each
alert listed twice, or did I somehow make a mess of things when
trying to change the default alerts?
No problem if it’s not an obvious answer, we can live with and
ignore the spurious CephPGImbalance alerts.
Thanks again,
Devin
On Jan 7, 2025, at 2:14 AM, Eugen Block <eblock@xxxxxx> wrote:
Hi,
sure thing, here's the diff how I changed it to 50% deviation
instead of 30%:
---snip---
diff -u
/var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml
/var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml.dist
---
/var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml 2024-12-17 10:03:23.540179209
+0100
+++
/var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml.dist 2024-12-17 10:03:00.380883413
+0100
@@ -237,13 +237,13 @@
type: "ceph_default"
- alert: "CephPGImbalance"
annotations:
- description: "OSD {{ $labels.ceph_daemon }} on {{
$labels.hostname }} deviates by more than 50% from average PG count."
+ description: "OSD {{ $labels.ceph_daemon }} on {{
$labels.hostname }} deviates by more than 30% from average PG count."
summary: "PGs are not balanced across OSDs"
expr: |
abs(
((ceph_osd_numpg > 0) - on (job) group_left
avg(ceph_osd_numpg > 0) by (job)) /
on (job) group_left avg(ceph_osd_numpg > 0) by (job)
- ) * on (ceph_daemon) group_left(hostname)
ceph_osd_metadata > 0.50
+ ) * on (ceph_daemon) group_left(hostname)
ceph_osd_metadata > 0.30
---snip---
Then you restart prometheus ('ceph orch ps --daemon-type
prometheus' shows you the exact daemon name):
ceph orch daemon restart prometheus.host1
This will only work until you upgrade prometheus, of course.
Regards,
Eugen
Zitat von "Devin A. Bougie" <devin.bougie@xxxxxxxxxxx>:
Thanks, Eugen. I’m afraid I haven’t yet found a way to either
disable the CephPGImbalance alert or change it to handle different
OSD sizes. Changing
/var/lib/ceph/<cluster_id>/home/ceph_default_alerts.yml doesn’t
seem to have any effect, and I haven’t even managed to change the
behavior from within the running prometheus container.
If you have a functioning workaround, can you give a little more
detail on exactly what yaml file you’re changing and where?
Thanks again,
Devin
On Dec 30, 2024, at 12:39 PM, Eugen Block <eblock@xxxxxx> wrote:
Funny, I wanted to take a look next week how to deal with
different OSD sizes or if somebody already has a fix for that. My
workaround is changing the yaml file for Prometheus as well.
Zitat von "Devin A. Bougie" <devin.bougie@xxxxxxxxxxx>:
Hi, All. We are using cephadm to manage a 19.2.0 cluster on
fully-updated AlmaLinux 9 hosts, and would greatly appreciate
help modifying or overriding the alert rules in
ceph_default_alerts.yml. Is the best option to simply update
the /var/lib/ceph/<cluster_id>/home/ceph_default_alerts.yml file?
In particular, we’d like to either disable the CephPGImbalance
alert or change it to calculate averages per-pool or
per-crush_rule instead of globally as in [1].
We currently have PG autoscaling enabled, and have two separate
crush_rules (one with large spinning disks, one with much
smaller nvme drives). Although I don’t believe it causes any
technical issues with our configuration, our dashboard is full
of CephPGImbalance alerts that would be nice to clean up without
having to create periodic silences.
Any help or suggestions would be greatly appreciated.
Many thanks,
Devin
[1]
https://github.com/rook/rook/discussions/13126#discussioncomment-10043490
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx