Re: Modify or override ceph_default_alerts.yml

Eugen Block <eblock@xxxxxx> · Tue, 07 Jan 2025 07:14:47 +0000

Hi,

sure thing, here's the diff how I changed it to 50% deviation instead of 30%:

---snip---
diff -u  
/var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml  
/var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml.dist

---  
/var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml   
  2024-12-17 10:03:23.540179209 +0100
+++  
/var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml.dist       2024-12-17 10:03:00.380883413  
+0100
@@ -237,13 +237,13 @@
           type: "ceph_default"
       - alert: "CephPGImbalance"
         annotations:
-          description: "OSD {{ $labels.ceph_daemon }} on {{  
$labels.hostname }} deviates by more than 50% from average PG count."
+          description: "OSD {{ $labels.ceph_daemon }} on {{  
$labels.hostname }} deviates by more than 30% from average PG count."
           summary: "PGs are not balanced across OSDs"
         expr: |
           abs(
             ((ceph_osd_numpg > 0) - on (job) group_left  
avg(ceph_osd_numpg > 0) by (job)) /
             on (job) group_left avg(ceph_osd_numpg > 0) by (job)
-          ) * on (ceph_daemon) group_left(hostname) ceph_osd_metadata > 0.50
+          ) * on (ceph_daemon) group_left(hostname) ceph_osd_metadata > 0.30
---snip---

Then you restart prometheus ('ceph orch ps --daemon-type prometheus'  
shows you the exact daemon name):

ceph orch daemon restart prometheus.host1

This will only work until you upgrade prometheus, of course.

Regards,
Eugen


Zitat von "Devin A. Bougie" <devin.bougie@xxxxxxxxxxx>:

Thanks, Eugen.  I’m afraid I haven’t yet found a way to either  
disable the CephPGImbalance alert or change it to handle different  
OSD sizes.  Changing  
/var/lib/ceph/<cluster_id>/home/ceph_default_alerts.yml doesn’t seem  
to have any effect, and I haven’t even managed to change the  
behavior from within the running prometheus container.

If you have a functioning workaround, can you give a little more  
detail on exactly what yaml file you’re changing and where?

Thanks again,
Devin

On Dec 30, 2024, at 12:39 PM, Eugen Block <eblock@xxxxxx> wrote:

Funny, I wanted to take a look next week how to deal with different  
OSD sizes or if somebody already has a fix for that. My workaround  
is changing the yaml file for Prometheus as well.

Zitat von "Devin A. Bougie" <devin.bougie@xxxxxxxxxxx>:

Hi, All.  We are using cephadm to manage a 19.2.0 cluster on  
fully-updated AlmaLinux 9 hosts, and would greatly appreciate help  
modifying or overriding the alert rules in  
ceph_default_alerts.yml.  Is the best option to simply update the  
/var/lib/ceph/<cluster_id>/home/ceph_default_alerts.yml file?

In particular, we’d like to either disable the CephPGImbalance  
alert or change it to calculate averages per-pool or  
per-crush_rule instead of globally as in [1].

We currently have PG autoscaling enabled, and have two separate  
crush_rules (one with large spinning disks, one with much smaller  
nvme drives).  Although I don’t believe it causes any technical  
issues with our configuration, our dashboard is full of  
CephPGImbalance alerts that would be nice to clean up without  
having to create periodic silences.

Any help or suggestions would be greatly appreciated.

Many thanks,
Devin

[1]  
https://github.com/rook/rook/discussions/13126#discussioncomment-10043490
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx