Re: Modify or override ceph_default_alerts.yml

"Devin A. Bougie" <devin.bougie@xxxxxxxxxxx> · Wed, 22 Jan 2025 15:34:10 +0000

I just wanted to followup to explain how we ended up with each alert being listed twice, which also prevented our changes to ceph_alerts.yml from taking effect.

We only had one prometheus service running, and only one PGImbalance rule in the /var/lib/ceph/{FSID}/prometheus.{host}/etc/prometheus/alerting/ceph_alerts.yml file.

*However*, before modifying it I had first backed up the original file to /var/lib/ceph/{FSID}/prometheus.{host}/etc/prometheus/alerting/ceph_alerts.yml.bk

Once I removed the ceph_alerts.yml.bk file, the dashboard only showed one alert rule as it should (modified for a deviation of 90%) and all of the “30%” active alerts cleared.

So for now, at least until we figure out how to override a given alert using templates, Eugen’s procedure works fine.
1.  Modify (but don’t backup or rename) /var/lib/ceph/{FSID}/prometheus.{host}/etc/prometheus/alerting/ceph_alerts.yml
2.  Restart prometheus

Many thanks to Eugen for their help tracking this down!

Sincerely,
Devin

> On Jan 13, 2025, at 9:55 PM, Devin A. Bougie <devin.bougie@xxxxxxxxxxx> wrote:
>
> Hi Eugen,
>
> No, as far as I can tell I only have one prometheus service running.
>
> ———
> [root@cephman2 ~]# ceph orch ls prometheus --export
> service_type: prometheus
> service_name: prometheus
> placement:
>   count: 1
>   label: _admin
>
> [root@cephman2 ~]# ceph orch ps --daemon-type prometheus
> NAME                 HOST                         PORTS   STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID
> prometheus.cephman2  cephman2.classe.cornell.edu  *:9095  running (12h)     4m ago   3w     350M        -  2.43.0   a07b618ecd1d  5a8d88682c28
> ———
>
> Anything else I can check or do?
>
> Thanks,
> Devin
>
>> On Jan 13, 2025, at 6:39 PM, Eugen Block <eblock@xxxxxx> wrote:
>>
>> Do you have two Prometheus instances? Maybe you could share
>> ceph orch ls prometheus --export
>>
>> Or alternatively:
>> ceph orch ps --daemon-type prometheus
>>
>> You can use two instances for HA, but then you need to change the threshold for both, of course.
>>
>> Zitat von "Devin A. Bougie" <devin.bougie@xxxxxxxxxxx>:
>>
>>> Thanks, Eugen!  Just incase you have any more suggestions, this still isn’t quite working for us.
>>>
>>> Perhaps one clue is that in the Alerts view of the cephadm dashboard, every alert is listed twice.  We see two CephPGImbalance alerts, both set to 30% after redeploying the service.  If I then follow your procedure, one of the alerts updates to 50% as configured, but the other stays at 30.  Is it normal to see each alert listed twice, or did I somehow make a mess of things when trying to change the default alerts?
>>>
>>> No problem if it’s not an obvious answer, we can live with and ignore the spurious CephPGImbalance alerts.
>>>
>>> Thanks again,
>>> Devin
>>>
>>>> On Jan 7, 2025, at 2:14 AM, Eugen Block <eblock@xxxxxx> wrote:
>>>>
>>>> Hi,
>>>>
>>>> sure thing, here's the diff how I changed it to 50% deviation instead of 30%:
>>>>
>>>> ---snip---
>>>> diff -u /var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml /var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml.dist
>>>> --- /var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml    2024-12-17 10:03:23.540179209 +0100
>>>> +++ /var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml.dist       2024-12-17 10:03:00.380883413 +0100
>>>> @@ -237,13 +237,13 @@
>>>>          type: "ceph_default"
>>>>      - alert: "CephPGImbalance"
>>>>        annotations:
>>>> -          description: "OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} deviates by more than 50% from average PG count."
>>>> +          description: "OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} deviates by more than 30% from average PG count."
>>>>          summary: "PGs are not balanced across OSDs"
>>>>        expr: |
>>>>          abs(
>>>>            ((ceph_osd_numpg > 0) - on (job) group_left avg(ceph_osd_numpg > 0) by (job)) /
>>>>            on (job) group_left avg(ceph_osd_numpg > 0) by (job)
>>>> -          ) * on (ceph_daemon) group_left(hostname) ceph_osd_metadata > 0.50
>>>> +          ) * on (ceph_daemon) group_left(hostname) ceph_osd_metadata > 0.30
>>>> ---snip---
>>>>
>>>> Then you restart prometheus ('ceph orch ps --daemon-type prometheus' shows you the exact daemon name):
>>>>
>>>> ceph orch daemon restart prometheus.host1
>>>>
>>>> This will only work until you upgrade prometheus, of course.
>>>>
>>>> Regards,
>>>> Eugen
>>>>
>>>>
>>>> Zitat von "Devin A. Bougie" <devin.bougie@xxxxxxxxxxx>:
>>>>
>>>>> Thanks, Eugen.  I’m afraid I haven’t yet found a way to either disable the CephPGImbalance alert or change it to handle different OSD sizes.  Changing /var/lib/ceph/<cluster_id>/home/ceph_default_alerts.yml doesn’t seem to have any effect, and I haven’t even managed to change the behavior from within the running prometheus container.
>>>>>
>>>>> If you have a functioning workaround, can you give a little more detail on exactly what yaml file you’re changing and where?
>>>>>
>>>>> Thanks again,
>>>>> Devin
>>>>>
>>>>>> On Dec 30, 2024, at 12:39 PM, Eugen Block <eblock@xxxxxx> wrote:
>>>>>>
>>>>>> Funny, I wanted to take a look next week how to deal with different OSD sizes or if somebody already has a fix for that. My workaround is changing the yaml file for Prometheus as well.
>>>>>>
>>>>>> Zitat von "Devin A. Bougie" <devin.bougie@xxxxxxxxxxx>:
>>>>>>
>>>>>>> Hi, All.  We are using cephadm to manage a 19.2.0 cluster on fully-updated AlmaLinux 9 hosts, and would greatly appreciate help modifying or overriding the alert rules in ceph_default_alerts.yml.  Is the best option to simply update the /var/lib/ceph/<cluster_id>/home/ceph_default_alerts.yml file?
>>>>>>>
>>>>>>> In particular, we’d like to either disable the CephPGImbalance alert or change it to calculate averages per-pool or per-crush_rule instead of globally as in [1].
>>>>>>>
>>>>>>> We currently have PG autoscaling enabled, and have two separate crush_rules (one with large spinning disks, one with much smaller nvme drives).  Although I don’t believe it causes any technical issues with our configuration, our dashboard is full of CephPGImbalance alerts that would be nice to clean up without having to create periodic silences.
>>>>>>>
>>>>>>> Any help or suggestions would be greatly appreciated.
>>>>>>>
>>>>>>> Many thanks,
>>>>>>> Devin
>>>>>>>
>>>>>>> [1] https://github.com/rook/rook/discussions/13126#discussioncomment-10043490
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx