Re: Prometheus Alerts for Ceph: Reference Rules

Boris Ranto <branto@xxxxxxxxxx> · Thu, 11 Apr 2019 13:39:37 +0200

I do setup prometheus to use a file glob but by "remotely" here, I was
talking about a (rest) API to do that. This requires a direct access
to the entire host to add a set of alerting rules.

On Thu, Apr 11, 2019 at 1:02 PM Jan Fajerski <jan.fajerski@xxxxxxxx> wrote:
>
> On Thu, Apr 11, 2019 at 12:51:34PM +0200, Boris Ranto wrote:
> >Hey all,
> >
> >if we want to have a set of rules in the official Ceph repo then I
> >would recommend merging all the alerting rules and letting the
> > products decide on what alerts do they want to mute.
> +1
> >
> >My reasoning here is that afaik, you can configure AlertManager (even
> >remotely) to silence certain alerts but you can't configure Prometheus
> >remotely to include new alerting rules.
> One solution is to configure the prometheus config variable rule_files with an
> appropriate file glob (say /etc/prometheus/rules/*). Then a user can simply drop
> a rules file in that location and voila.
> >
> >Regards,
> >Boris
> >
> >On Thu, Apr 11, 2019 at 12:20 PM Ernesto Puerta <epuertat@xxxxxxxxxx> wrote:
> >>
> >> Hi Cephers,
> >>
> >> With Nautilus release Ceph-dashoard now can display Prometheus alerts as pop-up notifications (aka toasts). [1]
> >>
> >> As we did similarly for Grafana, there was a proposal to provide a reference alert rule file, [2] but not action has been taken since a while ago, so I'd want to refresh this discussion.
> >>
> >> With the help of Jan and Boris, I've got the following sources:
> >>
> >> - DeepSea [3]
> >> - Ceph-Ansible [4]
> >> - Ceph-mixins (Jsonnet) [5]
> >>
> >> The three of them share some common alerts (health, OSD o Mons down, network drops/errors), but each one brings its own (and IMHO worthy) alerts:
> >>
> >> - DeepSea: alerts based on predicted filling rates (both for pool and disk).
> >> - CephAnsible: slow OSD response, host loss check.
> >>
> >> This is the complete list:
> >>
> >> [common]
> >> - Ceph Health: Warning/Error
> >> - OSDs Down
> >> - Mon down/quorum
> >> - Disks/OSDs near full
> >> - Network errors/drops
> >> - PG checks: inactive/unclean/stuck
> >> - PG count: high pg count deviation/OSD(s) with High PG Count
> >> - Pool capacity
> >>
> >> [deepsea]
> >> - 10% OSDs down
> >> - flap osd
> >> - root volume full
> >> - storage filling up
> >> - pool filling up
> >>
> >> [ceph-ansible]
> >> - OSD Host(s) Down
> >> - OSD Host Loss Check
> >> - Slow OSD Responses
> >> - Cluster Capacity Low
> >>
> >> My specific questions:
> >> - Anyone else currently undertaking this?
> >> - Any objections to have this reference set in ceph repo (monitoring/prometheus) or alternative proposals?
> >> - Any relevant alert missing there?
> >>
> >> Any other feedback will also be appreciated!
> >>
> >> Kind Regards,
> >> Ernesto
> >>
> >> [1]: http://docs.ceph.com/docs/nautilus/mgr/dashboard/#enabling-prometheus-alerting
> >> [2]: https://tracker.ceph.com/issues/24977
> >> [3]: https://github.com/SUSE/DeepSea/blob/SES5/srv/salt/ceph/monitoring/prometheus/files/ses_default_alerts.yml
> >> [4]: https://github.com/ceph/ceph-ansible/blob/wip-dashboard/roles/ceph-prometheus/files/ceph_dashboard.yml
> >> [5]: https://github.com/ceph/ceph-mixins/tree/master/alerts
> >
>
> --
> Jan Fajerski
> Engineer Enterprise Storage
> SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
> HRB 21284 (AG Nürnberg)