Hi Cephers, With Nautilus release Ceph-dashoard now can display Prometheus alerts as pop-up notifications (aka toasts). (http://docs.ceph.com/docs/nautilus/mgr/dashboard/#enabling-prometheus-alerting) As we did similarly for Grafana, there was a proposal to provide a reference alert rule file (https://tracker.ceph.com/issues/24977), but not action has been taken since a while ago, so I'd want to refresh this discussion. With the help of Jan and Boris, I've got the following sources: - DeepSea: https://github.com/SUSE/DeepSea/blob/SES5/srv/salt/ceph/monitoring/prometheus/files/ses_default_alerts.yml - Ceph-Ansible: https://github.com/ceph/ceph-ansible/blob/wip-dashboard/roles/ceph-prometheus/files/ceph_dashboard.yml - Ceph-mixins (Jsonnet): https://github.com/ceph/ceph-mixins/tree/master/alerts The three of them share some common alerts (health, OSD o Mons down, network drops/errors), but each one brings its own (and IMHO worthy) alerts: - DeepSea: alerts based on predicted filling rates (both for pool and disk). - CephAnsible: slow OSD response, host loss check. This is the complete list: * common: - Ceph Health: Warning/Error - OSDs Down - Mon down/quorum - Disks/OSDs near full - Network errors/drops - PG checks: inactive/unclean/stuck - PG count: high pg count deviation/OSD(s) with High PG Count - Pool capacity * deepsea: - 10% OSDs down - flap osd - root volume full - storage filling up - pool filling up * ceph-ansible: - OSD Host(s) Down - OSD Host Loss Check - Slow OSD Responses - Cluster Capacity Low My specific questions: - Anyone else currently undertaking this? - Any objections to have this reference set in ceph repo (monitoring/prometheus) or alternative proposals? - Any relevant alert missing there? Any other feedback will also be appreciated! Kind Regards, Ernesto