Re: FBR: Add monitoring for website build fails

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Thu, 27 Feb 2020 at 12:03, Rick Elrod <codeblock@xxxxxxxx> wrote:
On Thu, Feb 27, 2020 at 4:31 AM Clement Verna <cverna@xxxxxxxxxxxxxxxxx> wrote:
>
>
>
> On Thu, 27 Feb 2020 at 06:53, Rick Elrod <codeblock@xxxxxxxx> wrote:
>>
>> I'd like to apply the following which does:
>> - Adds a script I wrote for reading a timestamp from a file on disk
>> and alerting if the timestamp within it is NOT within a particular
>> delta to now.
>> - Applies this to sundries01 and uses it to check
>> /srv/websites/getfedora.org/build.timestamp.txt which now gets
>> generated as part of the websites build.
>>
>> The purpose is because sometimes someone will commit something to the
>> websites repo which breaks the build, but because of how we have
>> things set up in openshift (cronjob), we don't get any kind of alert
>> when that happens.
>
>
> I think it would be better to find a way to monitor the cronjob in OpenShift since that will be useful for other projects.
> Did you investigate that idea ?
>
>>
>>
>> Right now this sets the delta to 3 hours. In theory it should be 1,
>> but I figure let it try to build a few times before we start alerting.
>
>
> +1 but I would prefer a way to have notification on a failed cronjob :-)

I'd prefer that too (or probably in addition), but I don't know
anything about how to set up that monitoring right now.
It looks like there's an OpenShift API endpoint for monitoring crons:
https://major.io/2019/11/18/monitoring-openshift-cron-jobs/
but we'd need to set up an API key for nagios checks to use somehow.

Yes I think we would need to have a "nagios" service account, then that should give us a token to use for authentication.
 
Probably worth looking into, but for the time being I'd still like to
apply this FBR, as we are going to have some Outreachy activity
happening on websites soon and we need to know that the prod build
isn't broken.

-re

>
>>
>>
>> Rick
>>
>>
>> commit 657d050f6d699bc43973d968cd93d12131fca7f2
>> Author: Rick Elrod <relrod@xxxxxxxxxx>
>> Date:   Thu Feb 27 05:29:24 2020 +0000
>>
>>     nagios: Add script and check for checking that a timestamp within
>> a file is within a delta of now, and then use this for alerting when
>> websites stop building
>>
>>     Signed-off-by: Rick Elrod <relrod@xxxxxxxxxx>
>>
>> diff --git a/roles/nagios_client/files/scripts/check_timestamp_from_file
>> b/roles/nagios_client/files/scripts/check_timestamp_from_file
>> new file mode 100644
>> index 0000000..9064337
>> --- /dev/null
>> +++ b/roles/nagios_client/files/scripts/check_timestamp_from_file
>> @@ -0,0 +1,43 @@
>> +#!/usr/bin/env python
>> +
>> +# Takes a path to a file and a delta. The file must simply contain an epoch
>> +# timestamp. It can be an integer or a float, as can the delta.
>> +#
>> +# Alerts critical if (now - timestamp contained in file) > delta.
>> +#
>> +# Rick Elrod <relrod@xxxxxxxxxx>
>> +# MIT
>> +
>> +import sys
>> +import time
>> +
>> +if len(sys.argv) != 3:
>> +    print('UNKNOWN: Pass path to file and delta as parameters')
>> +    sys.exit(3)
>> +
>> +filename = sys.argv[1]
>> +delta = float(sys.argv[2])
>> +
>> +timestamp = None
>> +
>> +try:
>> +    with open(filename, 'r') as f:
>> +        timestamp = float(f.read().strip())
>> +except Exception as e:
>> +    print('UNKNOWN: Unable to open/read file path')
>> +    sys.exit(3)
>> +
>> +difference = round(time.time() - timestamp, 2)
>> +if difference > delta:
>> +    print(
>> +        'CRITICAL: Timestamp in file (%.2f) exceeds delta (%.2f) by
>> %.2f seconds' % (
>> +            timestamp,
>> +            delta,
>> +            difference - delta))
>> +    sys.exit(2)
>> +
>> +print('OK: Timestamp in file (%.2f) is within delta (%.2f) of now, by
>> %.2f seconds' % (
>> +    timestamp,
>> +    delta,
>> +    abs(difference - delta)))
>> +sys.exit(0)
>> diff --git a/roles/nagios_client/tasks/main.yml
>> b/roles/nagios_client/tasks/main.yml
>> index 2e5e0df..8e71a3b 100644
>> --- a/roles/nagios_client/tasks/main.yml
>> +++ b/roles/nagios_client/tasks/main.yml
>> @@ -47,6 +47,7 @@
>>    - check_osbs_api.py
>>    - check_ipa_replication
>>    - check_redis_queue.sh
>> +  - check_timestamp_from_file
>>    when: not inventory_hostname.startswith('noc')
>>    tags:
>>    - nagios_client
>> @@ -226,6 +227,16 @@
>>    tags:
>>    - nagios_client
>>
>> +- name: install nrpe checks for sundries/websites
>> +  template: src="" item }}.j2 dest=/etc/nrpe.d/{{ item }} owner=root
>> group=root mode=0644
>> +  with_items:
>> +  - check_websites_buildtime.cfg
>> +  when: inventory_hostname.startswith('sundries')
>> +  notify:
>> +  - restart nrpe
>> +  tags:
>> +  - nagios_client
>> +
>>  - name: install nrpe config for the RabbitMQ checks
>>    template:
>>      src: "rabbitmq_args.ini.j2"
>> diff --git a/roles/nagios_client/templates/check_websites_buildtime.cfg.j2
>> b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2
>> new file mode 100644
>> index 0000000..ff5639d
>> --- /dev/null
>> +++ b/roles/nagios_client/templates/check_websites_buildtime.cfg.j2
>> @@ -0,0 +1,2 @@
>> +# Alert if websites haven't been built in 3 hours
>> +command[check_websites_buildtime]={{ libdir
>> }}/nagios/plugins/check_timestamp_from_file
>> /srv/websites/getfedora.org/build.timestamp.txt 10800
>> diff --git a/roles/nagios_server/templates/nagios/services/websites.cfg.j2
>> b/roles/nagios_server/templates/nagios/services/websites.cfg.j2
>> index 85e8f8e..c8958d7 100644
>> --- a/roles/nagios_server/templates/nagios/services/websites.cfg.j2
>> +++ b/roles/nagios_server/templates/nagios/services/websites.cfg.j2
>> @@ -316,4 +316,14 @@ define service {
>>    use                   ppc-secondarytemplate
>>  }
>>
>> +## Auxillary to websites but necessary to make them happen
>> +
>> +define service {
>> +  host_name             sundries01.phx2.fedoraproject.org
>> +  service_description   websites build happened recently
>> +  check_command         check_by_nrpe!check_websites_buildtime
>> +  use                   websitetemplate
>> +}
>> +
>> +
>>  {% endif %}
>> _______________________________________________
>> infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
>> To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx
>> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>> List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
>
> _______________________________________________
> infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
> To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx

[Index of Archives]     [Fedora Development]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [KDE Users]

  Powered by Linux