On Fri, 13 Jan 2017 at 22:15, Chris Jones <cjones@xxxxxxxxxxx> wrote:
General question/survey:Those that have larger clusters, how are you doing alerting/monitoring? Meaning, do you trigger off of 'HEALTH_WARN', etc? Not really talking about collectd related but more on initial alerts of an issue or potential issue? What threshold do you use basically? Just trying to get a pulse of what others are doing.Thanks in advance.--Best Regards,Chris JonesBloombergHi,We monitor for 'low iops'. The number differs on our clusters. For example if we have only 3000 iops per second, there is something wrong going on.Another good check is for s3 api. We try to read an object from s3 api every 30 seconds.Also we have many checks like more than 10% osds are down, pg inactive, cluster has degradated capacity and similiar. Some of these checks are not critical and we get only emails.One more important thing is disk latency monitoring. We've had huge slowdowns on our cluster when journalling ssd disks wear out. It's quite hard to understand what's going on, because all osds are up and running, but cluster is not performing at all.Network.errors on interfaces could be important. We had some issues, when physical cable was mulfunctioning and cluster had many blocks.
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com