Hi Redo,
I've been looking into the templates and have a question. Maybe you
could help clarify. I understand that I can create custom alerts and
inject them with:
ceph config-key set
mgr/cephadm/services/prometheus/alerting/custom_alerts.yml -i
custom_alerts.yml
It works when I want additional alerts, okay.
But this way I can not override the original alert (let's stay at
"CephPGImbalance" as an example. I can create my own alert as
described above (I don't even have to rename it), let's say 3%
deviation in a test cluster, but it would show up in *addition* to the
original 30% deviation. And although this command works as well
(trying to override the defaults):
ceph config-key set
mgr/cephadm/services/prometheus/alerting/ceph_alerts.yml -i
ceph_alerts.yml
The default 30% value is not overridden. So the question is, how to
actually change the original alert other than the workaround we
already discussed here? Or am I misunderstanding something here?
Thanks!
Eugen
Zitat von Redouane Kachach <rkachach@xxxxxxxxxx>:
> Just FYI: cephadm does support providing/using a custom template (see the
> docs on [1]). For example using the following cmd you can override the
> prometheus template:
>
>> ceph config-key set mgr/cephadm/services/prometheus/prometheus.yml
<value>
>
> After changing the template you have to reconfigure the service in order
to
> redeploy the daemons with your new config by:
>
>> ceph orch reconfig prometheus
>
> Then you can go to the corresponding directory
> on /var/lib/ceph/<your-fsid>/<your-daemon>/... to see if the container
has
> got the new config.
>
>
> *Note:* In general most of the templates have some variables and they are
> used to dynamically generate the configuration files. So be careful when
> changing the template. I'd recommend
> using the current one as base (you can see where to find them in the
docs)
> and then modify it to add your custom config but without altering the
> dynamic parts of the template.
>
> [1]
https://docs.ceph.com/en/reef/cephadm/services/monitoring/#option-names
>
> Best,
> Redo.
>
>
> On Tue, Jan 14, 2025 at 8:45 AM Eugen Block <eblock@xxxxxx> wrote:
>
>> Ah, I checked on a newer test cluster (Squid) and now I see what you
>> mean. The alert is shown per OSD in the dashboard, if you open the
>> dropdown you see which daemons are affected. I think it works a bit
>> different in Pacific (that's what the customer is still running) when
>> I last had to modify this. How many OSDs do you have? I noticed that
>> it takes a few seconds for prometheus to clear the warning with only 3
>> OSDs in my lab cluster. Maybe you could share a screenshot (with
>> redacted sensitive data) showing the alerts? And the status of the
>> affected OSDs as well.
>>
>>
>> Zitat von "Devin A. Bougie" <devin.bougie@xxxxxxxxxxx>:
>>
>> > Hi Eugen,
>> >
>> > No, as far as I can tell I only have one prometheus service running.
>> >
>> > ———
>> >
>> > [root@cephman2 ~]# ceph orch ls prometheus --export
>> >
>> > service_type: prometheus
>> >
>> > service_name: prometheus
>> >
>> > placement:
>> >
>> > count: 1
>> >
>> > label: _admin
>> >
>> >
>> > [root@cephman2 ~]# ceph orch ps --daemon-type prometheus
>> >
>> > NAME HOST PORTS STATUS
>> > REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID
>> > CONTAINER ID
>> >
>> > prometheus.cephman2 cephman2.classe.cornell.edu *:9095 running
>> > (12h) 4m ago 3w 350M - 2.43.0 a07b618ecd1d
>> > 5a8d88682c28
>> >
>> > ———
>> >
>> > Anything else I can check or do?
>> >
>> > Thanks,
>> > Devin
>> >
>> > On Jan 13, 2025, at 6:39 PM, Eugen Block <eblock@xxxxxx> wrote:
>> >
>> > Do you have two Prometheus instances? Maybe you could share
>> > ceph orch ls prometheus --export
>> >
>> > Or alternatively:
>> > ceph orch ps --daemon-type prometheus
>> >
>> > You can use two instances for HA, but then you need to change the
>> > threshold for both, of course.
>> >
>> > Zitat von "Devin A. Bougie"
>> > <devin.bougie@xxxxxxxxxxx<mailto:devin.bougie@xxxxxxxxxxx>>:
>> >
>> > Thanks, Eugen! Just incase you have any more suggestions, this
>> > still isn’t quite working for us.
>> >
>> > Perhaps one clue is that in the Alerts view of the cephadm
>> > dashboard, every alert is listed twice. We see two CephPGImbalance
>> > alerts, both set to 30% after redeploying the service. If I then
>> > follow your procedure, one of the alerts updates to 50% as
>> > configured, but the other stays at 30. Is it normal to see each
>> > alert listed twice, or did I somehow make a mess of things when
>> > trying to change the default alerts?
>> >
>> > No problem if it’s not an obvious answer, we can live with and
>> > ignore the spurious CephPGImbalance alerts.
>> >
>> > Thanks again,
>> > Devin
>> >
>> > On Jan 7, 2025, at 2:14 AM, Eugen Block <eblock@xxxxxx> wrote:
>> >
>> > Hi,
>> >
>> > sure thing, here's the diff how I changed it to 50% deviation instead
of
>> 30%:
>> >
>> > ---snip---
>> > diff -u
>> >
>>
/var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml
>>
>> >
>>
/var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml.dist
>> > ---
>> >
>>
/var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml
>> 2024-12-17 10:03:23.540179209
>> > +0100
>> > +++
>> >
>>
/var/lib/ceph/{FSID}/prometheus.host1/etc/prometheus/alerting/ceph_alerts.yml.dist
>> 2024-12-17 10:03:00.380883413
>> > +0100
>> > @@ -237,13 +237,13 @@
>> > type: "ceph_default"
>> > - alert: "CephPGImbalance"
>> > annotations:
>> > - description: "OSD {{ $labels.ceph_daemon }} on {{
>> > $labels.hostname }} deviates by more than 50% from average PG count."
>> > + description: "OSD {{ $labels.ceph_daemon }} on {{
>> > $labels.hostname }} deviates by more than 30% from average PG count."
>> > summary: "PGs are not balanced across OSDs"
>> > expr: |
>> > abs(
>> > ((ceph_osd_numpg > 0) - on (job) group_left
>> > avg(ceph_osd_numpg > 0) by (job)) /
>> > on (job) group_left avg(ceph_osd_numpg > 0) by (job)
>> > - ) * on (ceph_daemon) group_left(hostname)
ceph_osd_metadata >
>> 0.50
>> > + ) * on (ceph_daemon) group_left(hostname)
ceph_osd_metadata >
>> 0.30
>> > ---snip---
>> >
>> > Then you restart prometheus ('ceph orch ps --daemon-type prometheus'
>> > shows you the exact daemon name):
>> >
>> > ceph orch daemon restart prometheus.host1
>> >
>> > This will only work until you upgrade prometheus, of course.
>> >
>> > Regards,
>> > Eugen
>> >
>> >
>> > Zitat von "Devin A. Bougie" <devin.bougie@xxxxxxxxxxx>:
>> >
>> > Thanks, Eugen. I’m afraid I haven’t yet found a way to either
>> > disable the CephPGImbalance alert or change it to handle different
>> > OSD sizes. Changing
>> > /var/lib/ceph/<cluster_id>/home/ceph_default_alerts.yml doesn’t seem
>> > to have any effect, and I haven’t even managed to change the
>> > behavior from within the running prometheus container.
>> >
>> > If you have a functioning workaround, can you give a little more
>> > detail on exactly what yaml file you’re changing and where?
>> >
>> > Thanks again,
>> > Devin
>> >
>> > On Dec 30, 2024, at 12:39 PM, Eugen Block <eblock@xxxxxx> wrote:
>> >
>> > Funny, I wanted to take a look next week how to deal with different
>> > OSD sizes or if somebody already has a fix for that. My workaround
>> > is changing the yaml file for Prometheus as well.
>> >
>> > Zitat von "Devin A. Bougie" <devin.bougie@xxxxxxxxxxx>:
>> >
>> > Hi, All. We are using cephadm to manage a 19.2.0 cluster on
>> > fully-updated AlmaLinux 9 hosts, and would greatly appreciate help
>> > modifying or overriding the alert rules in ceph_default_alerts.yml.
>> > Is the best option to simply update the
>> > /var/lib/ceph/<cluster_id>/home/ceph_default_alerts.yml file?
>> >
>> > In particular, we’d like to either disable the CephPGImbalance alert
>> > or change it to calculate averages per-pool or per-crush_rule
>> > instead of globally as in [1].
>> >
>> > We currently have PG autoscaling enabled, and have two separate
>> > crush_rules (one with large spinning disks, one with much smaller
>> > nvme drives). Although I don’t believe it causes any technical
>> > issues with our configuration, our dashboard is full of
>> > CephPGImbalance alerts that would be nice to clean up without having
>> > to create periodic silences.
>> >
>> > Any help or suggestions would be greatly appreciated.
>> >
>> > Many thanks,
>> > Devin
>> >
>> > [1]
>> >
>>
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frook%2Frook%2Fdiscussions%2F13126%23discussioncomment-10043490&data=05%7C02%7Cdevin.bougie%40cornell.edu%7C27dddfa00b2e4475b30e08dd342b82cc%7C5d7e43661b9b45cf8e79b14b27df46e1%7C0%7C0%7C638724083682129542%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=%2FrmKNJ1hVwWdQ5U05CvNXX1Df3f4SR2HAxTyZA3PJKw%3D&reserved=0
>> <
https://github.com/rook/rook/discussions/13126#discussioncomment-10043490
>> >
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:
ceph-users@xxxxxxx>
>> > To unsubscribe send an email to
>> > ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
>> >
>> >
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >
>> >
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >
>> >
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>