Delivery Status Notification (Failure)

Ernesto Puerta <epuertat@xxxxxxxxxx> · Thu, 11 Apr 2019 12:39:56 +0200

Hi Cephers,

With Nautilus release Ceph-dashoard now can display Prometheus alerts
as pop-up notifications (aka toasts).
(http://docs.ceph.com/docs/nautilus/mgr/dashboard/#enabling-prometheus-alerting)

As we did similarly for Grafana, there was a proposal to provide a
reference alert rule file (https://tracker.ceph.com/issues/24977), but
not action has been taken since a while ago, so I'd want to refresh
this discussion.

With the help of Jan and Boris, I've got the following sources:

- DeepSea: https://github.com/SUSE/DeepSea/blob/SES5/srv/salt/ceph/monitoring/prometheus/files/ses_default_alerts.yml
- Ceph-Ansible:
https://github.com/ceph/ceph-ansible/blob/wip-dashboard/roles/ceph-prometheus/files/ceph_dashboard.yml
- Ceph-mixins (Jsonnet): https://github.com/ceph/ceph-mixins/tree/master/alerts

The three of them share some common alerts (health, OSD o Mons down,
network drops/errors), but each one brings its own (and IMHO worthy)
alerts:

- DeepSea: alerts based on predicted filling rates (both for pool and disk).
- CephAnsible: slow OSD response, host loss check.

This is the complete list:

* common:
- Ceph Health: Warning/Error
- OSDs Down
- Mon down/quorum
- Disks/OSDs near full
- Network errors/drops
- PG checks: inactive/unclean/stuck
- PG count: high pg count deviation/OSD(s) with High PG Count
- Pool capacity

* deepsea:
- 10% OSDs down
- flap osd
- root volume full
- storage filling up
- pool filling up

* ceph-ansible:
- OSD Host(s) Down
- OSD Host Loss Check
- Slow OSD Responses
- Cluster Capacity Low

My specific questions:
- Anyone else currently undertaking this?
- Any objections to have this reference set in ceph repo
(monitoring/prometheus) or alternative proposals?
- Any relevant alert missing there?

Any other feedback will also be appreciated!

Kind Regards,
Ernesto