I just wanted to followup to explain how we ended up with each alert being listed twice, which also prevented our changes to ceph_alerts.yml from taking effect. We only had one prometheus service running, and only one PGImbalance rule in the /var/lib/ceph/{FSID}/prometheus.{host}/etc/prometheus/alerting/ceph_alerts.yml file. *However*, before modifying it I had first backed up the original file to /var/lib/ceph/{FSID}/prometheus.{host}/etc/prometheus/alerting/ceph_alerts.yml.bk Once I removed the ceph_alerts.yml.bk file, the dashboard only showed one alert rule as it should (modified for a deviation of 90%) and all of the “30%” active alerts cleared. So for now, at least until we figure out how to override a given alert using templates, Eugen’s procedure works fine. 1. Modify (but don’t backup or rename) /var/lib/ceph/{FSID}/prometheus.{host}/etc/prometheus/alerting/ceph_alerts.yml 2. Restart prometheus Many thanks to Eugen for their help tracking this down! Sincerely, Devin > On Jan 13, 2025, at 9:55 PM, Devin A. Bougie <devin.bougie@xxxxxxxxxxx> wrote: > > Hi Eugen, > > No, as far as I can tell I only have one prometheus service running. > > ——— > [root@cephman2 ~]# ceph orch ls prometheus --export > service_type: prometheus > service_name: prometheus > placement: > count: 1 > label: _admin > > [root@cephman2 ~]# ceph orch ps --daemon-type prometheus > NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID > prometheus.cephman2 cephman2.classe.cornell.edu *:9095 running (12h) 4m ago 3w 350M - 2.43.0 a07b618ecd1d 5a8d88682c28 > ——— > > Anything else I can check or do? > > Thanks, > Devin > >> On Jan 13, 2025, at 6:39 PM, Eugen Block <eblock@xxxxxx> wrote: >> >> Do you have two Prometheus instances? Maybe you could share >> ceph orch ls prometheus --export >> >> Or alternatively: >> ceph orch ps --daemon-type prometheus >> >> You can use two instances for HA, but then you need to change the threshold for both, of course. >> >> Zitat von "Devin A. Bougie" <devin.bougie@xxxxxxxxxxx>: >> >>> Thanks, Eugen! Just incase you have any more suggestions, this still isn’t quite working for us. >>> >>> Perhaps one clue is that in the Alerts view of the cephadm dashboard, every alert is listed twice. We see two CephPGImbalance alerts, both set to 30% after redeploying the service. If I then follow your procedure, one of the alerts updates to 50% as configured, but the other stays at 30. Is it normal to see each alert listed twice, or did I somehow make a mess of things when trying to change the default alerts? >>> >>> No problem if it’s not an obvious answer, we can live with and ignore the spurious CephPGImbalance alerts. >>> >>> Thanks again, >>> Devin >>> >>>> On Jan 7, 2025, at 2:14 AM, Eugen Block <eblock@xxxxxx> wrote: >>>> >>>> Hi, >>>> >>>> sure thing, here's the diff how I changed it to 50% deviation instead of 30%: >>>> >>>> ---snip--- >>>> diff -u /var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml /var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml.dist >>>> --- /var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml 2024-12-17 10:03:23.540179209 +0100 >>>> +++ /var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml.dist 2024-12-17 10:03:00.380883413 +0100 >>>> @@ -237,13 +237,13 @@ >>>> type: "ceph_default" >>>> - alert: "CephPGImbalance" >>>> annotations: >>>> - description: "OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} deviates by more than 50% from average PG count." >>>> + description: "OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} deviates by more than 30% from average PG count." >>>> summary: "PGs are not balanced across OSDs" >>>> expr: | >>>> abs( >>>> ((ceph_osd_numpg > 0) - on (job) group_left avg(ceph_osd_numpg > 0) by (job)) / >>>> on (job) group_left avg(ceph_osd_numpg > 0) by (job) >>>> - ) * on (ceph_daemon) group_left(hostname) ceph_osd_metadata > 0.50 >>>> + ) * on (ceph_daemon) group_left(hostname) ceph_osd_metadata > 0.30 >>>> ---snip--- >>>> >>>> Then you restart prometheus ('ceph orch ps --daemon-type prometheus' shows you the exact daemon name): >>>> >>>> ceph orch daemon restart prometheus.host1 >>>> >>>> This will only work until you upgrade prometheus, of course. >>>> >>>> Regards, >>>> Eugen >>>> >>>> >>>> Zitat von "Devin A. Bougie" <devin.bougie@xxxxxxxxxxx>: >>>> >>>>> Thanks, Eugen. I’m afraid I haven’t yet found a way to either disable the CephPGImbalance alert or change it to handle different OSD sizes. Changing /var/lib/ceph/<cluster_id>/home/ceph_default_alerts.yml doesn’t seem to have any effect, and I haven’t even managed to change the behavior from within the running prometheus container. >>>>> >>>>> If you have a functioning workaround, can you give a little more detail on exactly what yaml file you’re changing and where? >>>>> >>>>> Thanks again, >>>>> Devin >>>>> >>>>>> On Dec 30, 2024, at 12:39 PM, Eugen Block <eblock@xxxxxx> wrote: >>>>>> >>>>>> Funny, I wanted to take a look next week how to deal with different OSD sizes or if somebody already has a fix for that. My workaround is changing the yaml file for Prometheus as well. >>>>>> >>>>>> Zitat von "Devin A. Bougie" <devin.bougie@xxxxxxxxxxx>: >>>>>> >>>>>>> Hi, All. We are using cephadm to manage a 19.2.0 cluster on fully-updated AlmaLinux 9 hosts, and would greatly appreciate help modifying or overriding the alert rules in ceph_default_alerts.yml. Is the best option to simply update the /var/lib/ceph/<cluster_id>/home/ceph_default_alerts.yml file? >>>>>>> >>>>>>> In particular, we’d like to either disable the CephPGImbalance alert or change it to calculate averages per-pool or per-crush_rule instead of globally as in [1]. >>>>>>> >>>>>>> We currently have PG autoscaling enabled, and have two separate crush_rules (one with large spinning disks, one with much smaller nvme drives). Although I don’t believe it causes any technical issues with our configuration, our dashboard is full of CephPGImbalance alerts that would be nice to clean up without having to create periodic silences. >>>>>>> >>>>>>> Any help or suggestions would be greatly appreciated. >>>>>>> >>>>>>> Many thanks, >>>>>>> Devin >>>>>>> >>>>>>> [1] https://github.com/rook/rook/discussions/13126#discussioncomment-10043490 >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx