Re: Prometheus Alerts for Ceph: Reference Rules

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hey all,

if we want to have a set of rules in the official Ceph repo then I
would recommend merging all the alerting rules and letting the
products decide on what alerts do they want to mute.

My reasoning here is that afaik, you can configure AlertManager (even
remotely) to silence certain alerts but you can't configure Prometheus
remotely to include new alerting rules.

Regards,
Boris

On Thu, Apr 11, 2019 at 12:20 PM Ernesto Puerta <epuertat@xxxxxxxxxx> wrote:
>
> Hi Cephers,
>
> With Nautilus release Ceph-dashoard now can display Prometheus alerts as pop-up notifications (aka toasts). [1]
>
> As we did similarly for Grafana, there was a proposal to provide a reference alert rule file, [2] but not action has been taken since a while ago, so I'd want to refresh this discussion.
>
> With the help of Jan and Boris, I've got the following sources:
>
> - DeepSea [3]
> - Ceph-Ansible [4]
> - Ceph-mixins (Jsonnet) [5]
>
> The three of them share some common alerts (health, OSD o Mons down, network drops/errors), but each one brings its own (and IMHO worthy) alerts:
>
> - DeepSea: alerts based on predicted filling rates (both for pool and disk).
> - CephAnsible: slow OSD response, host loss check.
>
> This is the complete list:
>
> [common]
> - Ceph Health: Warning/Error
> - OSDs Down
> - Mon down/quorum
> - Disks/OSDs near full
> - Network errors/drops
> - PG checks: inactive/unclean/stuck
> - PG count: high pg count deviation/OSD(s) with High PG Count
> - Pool capacity
>
> [deepsea]
> - 10% OSDs down
> - flap osd
> - root volume full
> - storage filling up
> - pool filling up
>
> [ceph-ansible]
> - OSD Host(s) Down
> - OSD Host Loss Check
> - Slow OSD Responses
> - Cluster Capacity Low
>
> My specific questions:
> - Anyone else currently undertaking this?
> - Any objections to have this reference set in ceph repo (monitoring/prometheus) or alternative proposals?
> - Any relevant alert missing there?
>
> Any other feedback will also be appreciated!
>
> Kind Regards,
> Ernesto
>
> [1]: http://docs.ceph.com/docs/nautilus/mgr/dashboard/#enabling-prometheus-alerting
> [2]: https://tracker.ceph.com/issues/24977
> [3]: https://github.com/SUSE/DeepSea/blob/SES5/srv/salt/ceph/monitoring/prometheus/files/ses_default_alerts.yml
> [4]: https://github.com/ceph/ceph-ansible/blob/wip-dashboard/roles/ceph-prometheus/files/ceph_dashboard.yml
> [5]: https://github.com/ceph/ceph-mixins/tree/master/alerts



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux