I'd like to apply the following which does: - Adds a script I wrote for reading a timestamp from a file on disk and alerting if the timestamp within it is NOT within a particular delta to now. - Applies this to sundries01 and uses it to check /srv/websites/getfedora.org/build.timestamp.txt which now gets generated as part of the websites build. The purpose is because sometimes someone will commit something to the websites repo which breaks the build, but because of how we have things set up in openshift (cronjob), we don't get any kind of alert when that happens. Right now this sets the delta to 3 hours. In theory it should be 1, but I figure let it try to build a few times before we start alerting. Rick commit 657d050f6d699bc43973d968cd93d12131fca7f2 Author: Rick Elrod <relrod@xxxxxxxxxx> Date: Thu Feb 27 05:29:24 2020 +0000 nagios: Add script and check for checking that a timestamp within a file is within a delta of now, and then use this for alerting when websites stop building Signed-off-by: Rick Elrod <relrod@xxxxxxxxxx> diff --git a/roles/nagios_client/files/scripts/check_timestamp_from_file b/roles/nagios_client/files/scripts/check_timestamp_from_file new file mode 100644 index 0000000..9064337 --- /dev/null +++ b/roles/nagios_client/files/scripts/check_timestamp_from_file @@ -0,0 +1,43 @@ +#!/usr/bin/env python + +# Takes a path to a file and a delta. The file must simply contain an epoch +# timestamp. It can be an integer or a float, as can the delta. +# +# Alerts critical if (now - timestamp contained in file) > delta. +# +# Rick Elrod <relrod@xxxxxxxxxx> +# MIT + +import sys +import time + +if len(sys.argv) != 3: + print('UNKNOWN: Pass path to file and delta as parameters') + sys.exit(3) + +filename = sys.argv[1] +delta = float(sys.argv[2]) + +timestamp = None + +try: + with open(filename, 'r') as f: + timestamp = float(f.read().strip()) +except Exception as e: + print('UNKNOWN: Unable to open/read file path') + sys.exit(3) + +difference = round(time.time() - timestamp, 2) +if difference > delta: + print( + 'CRITICAL: Timestamp in file (%.2f) exceeds delta (%.2f) by %.2f seconds' % ( + timestamp, + delta, + difference - delta)) + sys.exit(2) + +print('OK: Timestamp in file (%.2f) is within delta (%.2f) of now, by %.2f seconds' % ( + timestamp, + delta, + abs(difference - delta))) +sys.exit(0) diff --git a/roles/nagios_client/tasks/main.yml b/roles/nagios_client/tasks/main.yml index 2e5e0df..8e71a3b 100644 --- a/roles/nagios_client/tasks/main.yml +++ b/roles/nagios_client/tasks/main.yml @@ -47,6 +47,7 @@ - check_osbs_api.py - check_ipa_replication - check_redis_queue.sh + - check_timestamp_from_file when: not inventory_hostname.startswith('noc') tags: - nagios_client @@ -226,6 +227,16 @@ tags: - nagios_client +- name: install nrpe checks for sundries/websites + template: src={{ item }}.j2 dest=/etc/nrpe.d/{{ item }} owner=root group=root mode=0644 + with_items: + - check_websites_buildtime.cfg + when: inventory_hostname.startswith('sundries') + notify: + - restart nrpe + tags: + - nagios_client + - name: install nrpe config for the RabbitMQ checks template: src: "rabbitmq_args.ini.j2" diff --git a/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 new file mode 100644 index 0000000..ff5639d --- /dev/null +++ b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2 @@ -0,0 +1,2 @@ +# Alert if websites haven't been built in 3 hours +command[check_websites_buildtime]={{ libdir }}/nagios/plugins/check_timestamp_from_file /srv/websites/getfedora.org/build.timestamp.txt 10800 diff --git a/roles/nagios_server/templates/nagios/services/websites.cfg.j2 b/roles/nagios_server/templates/nagios/services/websites.cfg.j2 index 85e8f8e..c8958d7 100644 --- a/roles/nagios_server/templates/nagios/services/websites.cfg.j2 +++ b/roles/nagios_server/templates/nagios/services/websites.cfg.j2 @@ -316,4 +316,14 @@ define service { use ppc-secondarytemplate } +## Auxillary to websites but necessary to make them happen + +define service { + host_name sundries01.phx2.fedoraproject.org + service_description websites build happened recently + check_command check_by_nrpe!check_websites_buildtime + use websitetemplate +} + + {% endif %} _______________________________________________ infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx