Just going into production now with a large-ish multisite radosgw setup on 10.2. We are starting off by alerting on anything that isn't HEALTH_OK, just to see how things go. If we get HEALTH_WARN but no mons or OSD's are down then it will be a low-level alert. We will massage scripts to pick up on different conditions.
We're using graphite via collectd for visualization.
-- Trey
On Fri, Jan 13, 2017 at 3:15 PM, Chris Jones <cjones@xxxxxxxxxxx> wrote:
General question/survey:Those that have larger clusters, how are you doing alerting/monitoring? Meaning, do you trigger off of 'HEALTH_WARN', etc? Not really talking about collectd related but more on initial alerts of an issue or potential issue? What threshold do you use basically? Just trying to get a pulse of what others are doing.Thanks in advance.--Best Regards,Chris JonesBloomberg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com