Re: Prometheus Alerts for Ceph: Reference Rules

Jan Fajerski <jan.fajerski@xxxxxxxx> · Thu, 11 Apr 2019 13:02:01 +0200

On Thu, Apr 11, 2019 at 12:51:34PM +0200, Boris Ranto wrote:
Hey all,

if we want to have a set of rules in the official Ceph repo then I
would recommend merging all the alerting rules and letting the
products decide on what alerts do they want to mute.
+1

My reasoning here is that afaik, you can configure AlertManager (even
remotely) to silence certain alerts but you can't configure Prometheus
remotely to include new alerting rules.
One solution is to configure the prometheus config variable rule_files with an 
appropriate file glob (say /etc/prometheus/rules/*). Then a user can simply drop 
a rules file in that location and voila.

Regards,
Boris

On Thu, Apr 11, 2019 at 12:20 PM Ernesto Puerta <epuertat@xxxxxxxxxx> wrote:

Hi Cephers,

With Nautilus release Ceph-dashoard now can display Prometheus alerts as pop-up notifications (aka toasts). [1]

As we did similarly for Grafana, there was a proposal to provide a reference alert rule file, [2] but not action has been taken since a while ago, so I'd want to refresh this discussion.

With the help of Jan and Boris, I've got the following sources:

- DeepSea [3]
- Ceph-Ansible [4]
- Ceph-mixins (Jsonnet) [5]

The three of them share some common alerts (health, OSD o Mons down, network drops/errors), but each one brings its own (and IMHO worthy) alerts:

- DeepSea: alerts based on predicted filling rates (both for pool and disk).
- CephAnsible: slow OSD response, host loss check.

This is the complete list:

[common]
- Ceph Health: Warning/Error
- OSDs Down
- Mon down/quorum
- Disks/OSDs near full
- Network errors/drops
- PG checks: inactive/unclean/stuck
- PG count: high pg count deviation/OSD(s) with High PG Count
- Pool capacity

[deepsea]
- 10% OSDs down
- flap osd
- root volume full
- storage filling up
- pool filling up

[ceph-ansible]
- OSD Host(s) Down
- OSD Host Loss Check
- Slow OSD Responses
- Cluster Capacity Low

My specific questions:
- Anyone else currently undertaking this?
- Any objections to have this reference set in ceph repo (monitoring/prometheus) or alternative proposals?
- Any relevant alert missing there?

Any other feedback will also be appreciated!

Kind Regards,
Ernesto

[1]: http://docs.ceph.com/docs/nautilus/mgr/dashboard/#enabling-prometheus-alerting
[2]: https://tracker.ceph.com/issues/24977
[3]: https://github.com/SUSE/DeepSea/blob/SES5/srv/salt/ceph/monitoring/prometheus/files/ses_default_alerts.yml
[4]: https://github.com/ceph/ceph-ansible/blob/wip-dashboard/roles/ceph-prometheus/files/ceph_dashboard.yml
[5]: https://github.com/ceph/ceph-mixins/tree/master/alerts

--
Jan Fajerski
Engineer Enterprise Storage
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)