Re: FBR: Add monitoring for website build fails

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Thu, 27 Feb 2020 at 06:53, Rick Elrod <codeblock@xxxxxxxx> wrote:
I'd like to apply the following which does:
- Adds a script I wrote for reading a timestamp from a file on disk
and alerting if the timestamp within it is NOT within a particular
delta to now.
- Applies this to sundries01 and uses it to check
/srv/websites/getfedora.org/build.timestamp.txt which now gets
generated as part of the websites build.

The purpose is because sometimes someone will commit something to the
websites repo which breaks the build, but because of how we have
things set up in openshift (cronjob), we don't get any kind of alert
when that happens.

I think it would be better to find a way to monitor the cronjob in OpenShift since that will be useful for other projects.
Did you investigate that idea ?
 

Right now this sets the delta to 3 hours. In theory it should be 1,
but I figure let it try to build a few times before we start alerting.

+1 but I would prefer a way to have notification on a failed cronjob :-)
 

Rick


commit 657d050f6d699bc43973d968cd93d12131fca7f2
Author: Rick Elrod <relrod@xxxxxxxxxx>
Date:   Thu Feb 27 05:29:24 2020 +0000

    nagios: Add script and check for checking that a timestamp within
a file is within a delta of now, and then use this for alerting when
websites stop building

    Signed-off-by: Rick Elrod <relrod@xxxxxxxxxx>

diff --git a/roles/nagios_client/files/scripts/check_timestamp_from_file
b/roles/nagios_client/files/scripts/check_timestamp_from_file
new file mode 100644
index 0000000..9064337
--- /dev/null
+++ b/roles/nagios_client/files/scripts/check_timestamp_from_file
@@ -0,0 +1,43 @@
+#!/usr/bin/env python
+
+# Takes a path to a file and a delta. The file must simply contain an epoch
+# timestamp. It can be an integer or a float, as can the delta.
+#
+# Alerts critical if (now - timestamp contained in file) > delta.
+#
+# Rick Elrod <relrod@xxxxxxxxxx>
+# MIT
+
+import sys
+import time
+
+if len(sys.argv) != 3:
+    print('UNKNOWN: Pass path to file and delta as parameters')
+    sys.exit(3)
+
+filename = sys.argv[1]
+delta = float(sys.argv[2])
+
+timestamp = None
+
+try:
+    with open(filename, 'r') as f:
+        timestamp = float(f.read().strip())
+except Exception as e:
+    print('UNKNOWN: Unable to open/read file path')
+    sys.exit(3)
+
+difference = round(time.time() - timestamp, 2)
+if difference > delta:
+    print(
+        'CRITICAL: Timestamp in file (%.2f) exceeds delta (%.2f) by
%.2f seconds' % (
+            timestamp,
+            delta,
+            difference - delta))
+    sys.exit(2)
+
+print('OK: Timestamp in file (%.2f) is within delta (%.2f) of now, by
%.2f seconds' % (
+    timestamp,
+    delta,
+    abs(difference - delta)))
+sys.exit(0)
diff --git a/roles/nagios_client/tasks/main.yml
b/roles/nagios_client/tasks/main.yml
index 2e5e0df..8e71a3b 100644
--- a/roles/nagios_client/tasks/main.yml
+++ b/roles/nagios_client/tasks/main.yml
@@ -47,6 +47,7 @@
   - check_osbs_api.py
   - check_ipa_replication
   - check_redis_queue.sh
+  - check_timestamp_from_file
   when: not inventory_hostname.startswith('noc')
   tags:
   - nagios_client
@@ -226,6 +227,16 @@
   tags:
   - nagios_client

+- name: install nrpe checks for sundries/websites
+  template: src="" item }}.j2 dest=/etc/nrpe.d/{{ item }} owner=root
group=root mode=0644
+  with_items:
+  - check_websites_buildtime.cfg
+  when: inventory_hostname.startswith('sundries')
+  notify:
+  - restart nrpe
+  tags:
+  - nagios_client
+
 - name: install nrpe config for the RabbitMQ checks
   template:
     src: "rabbitmq_args.ini.j2"
diff --git a/roles/nagios_client/templates/check_websites_buildtime.cfg.j2
b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2
new file mode 100644
index 0000000..ff5639d
--- /dev/null
+++ b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2
@@ -0,0 +1,2 @@
+# Alert if websites haven't been built in 3 hours
+command[check_websites_buildtime]={{ libdir
}}/nagios/plugins/check_timestamp_from_file
/srv/websites/getfedora.org/build.timestamp.txt 10800
diff --git a/roles/nagios_server/templates/nagios/services/websites.cfg.j2
b/roles/nagios_server/templates/nagios/services/websites.cfg.j2
index 85e8f8e..c8958d7 100644
--- a/roles/nagios_server/templates/nagios/services/websites.cfg.j2
+++ b/roles/nagios_server/templates/nagios/services/websites.cfg.j2
@@ -316,4 +316,14 @@ define service {
   use                   ppc-secondarytemplate
 }

+## Auxillary to websites but necessary to make them happen
+
+define service {
+  host_name             sundries01.phx2.fedoraproject.org
+  service_description   websites build happened recently
+  check_command         check_by_nrpe!check_websites_buildtime
+  use                   websitetemplate
+}
+
+
 {% endif %}
_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx

[Index of Archives]     [Fedora Development]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [KDE Users]

  Powered by Linux