Search Postgresql Archives

Nagios plugin to check slony replication

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I've finally got around to writing the two nagios plugins which I am using to check our slony cluster (on our linux servers). I'm posting them in case anyone else wants them or to use them as a basis for something else. These are based on Christopher Browne's scripts that ship with slony.

The two scripts perform different tasks.

check_slon checks to see that the slon daemon is in the proces list and optionally checks for any error or warning messages in the slon log file
it is called using two or three parameters; the clustername, the dbname and (optionally) the location of the log file. This script is to be executed on each node in the cluster (both master and slaves)


check_sloncluster checks that active receiver nodes are comfirming sync within 10 seconds of the master. I'm not entirely sure that this is the best strategy, and if you know otherwise, I'd love to hear. Requires two parameters; the clustername and the dbname. This script is executed on the master database only.

These scripts are designed to run on the host on which they are checking. With a little modification, they could check remote servers on the network. They are quite simplistic and may not be suitable for your environment. You are free to modify the code to suit your own needs.

John Sidney-Woollett

check_slon
==========

#!/bin/sh

# nagios plugin that checks whether the slon daemon is running
# if the 3rd parameter (LOGFILE) is specified then the log file is
# checked to see if the last entry is a WARN or FATAL message
#
# three possible exit statuses:
#  0 = OK
#  1 = Warning (warning in slon log file)
#  2 = Fatal Error (slon not running, or error in log file)
#
# script requires two or three parameters:
# CLUSTERNAME - name of slon cluster to be checked
# DBNAME - name of database being replicated
# LOGFILE - (optional) location of the slon log file
#
# Author:  John Sidney-Woollett
# Created: 26-Feb-2005
# Copyright 2005

# check parameters are valid
if [[ $# -lt 2 && $# -gt 3 ]]
then
  echo "Invalid parameters need CLUSTERNAME DBNAME [LOGFILE]"
  exit 2
fi

# assign parameters
CLUSTERNAME=$1
DBNAME=$2
LOGFILE=$3

# check to see whether the slon daemon is running
SLONPROCESS=`ps -auxww | egrep "[s]lon $CLUSTERNAME" | egrep "dbname=$DBNAME" | awk '{print $2}'`


if [ ! -n "$SLONPROCESS" ]
then
  echo "no slon process active"
  exit 2
fi

# if the logfile is specified, check it exists
# and check for the word ERROR or WARN in the last line
if [ -n "$LOGFILE" ]
then
  # check for log file
  if [ -f "$LOGFILE" ]
  then
    LOGLINE=`tail -1 $LOGFILE`
    LOGSTATUS=`tail -1 $LOGFILE | awk '{print $1}'`
    if [ $LOGSTATUS = "FATAL" ]
    then
      echo "$LOGLINE"
      exit 2
    elif [ $LOGSTATUS = "WARN" ]
    then
      echo "$LOGLINE"
      exit 1
    fi
  else
    echo "$LOGFILE not found"
    exit 2
  fi
fi

# otherwise all looks to be OK
echo "OK - slon process $SLONPROCESS"
exit 0



check_sloncluster
=================

#!/bin/sh

# nagios plugin that checks whether the slave nodes in a slony cluster
# are being updated from the master
#
# possible exit statuses:
#  0 = OK
#  2 = Error, one or more slave nodes are not sync'ing with the master
#
# script requires two parameters:
# CLUSTERNAME - name of slon cluster to be checked
# DBNAME - name of master database
#
# Author:  John Sidney-Woollett
# Created: 26-Feb-2005
# Copyright 2005

# check parameters are valid
if [[ $# -ne 2 ]]
then
  echo "Invalid parameters need CLUSTERNAME DBNAME"
  exit 2
fi

# assign parameters
CLUSTERNAME=$1
DBNAME=$2

# setup the query to check the replication status
SQL="select case
  when ttlcount = okcount then 'OK - '||okcount||' nodes in sync'
  else 'ERROR - '||ttlcount-okcount||' of '||ttlcount||' nodes not in sync'
end as syncstatus
from (
-- determine total active receivers
select (select count(distinct sub_receiver)
    from _$CLUSTERNAME.sl_subscribe
    where sub_active = true) as ttlcount,
(
-- determine active nodes syncing within 10 seconds
 select count(*) from (
  select st_received, st_last_received_ts - st_last_event_ts as cfmdelay
  from _$CLUSTERNAME.sl_status
  where st_received in (
    select distinct sub_receiver
    from _$CLUSTERNAME.sl_subscribe
    where sub_active = true
  )
) as t1
where cfmdelay < interval '10 secs') as okcount
) as t2"

# query the master database
CHECK=`/usr/local/pgsql/bin/psql -c "$SQL" --tuples-only -U postgres $DBNAME`


if [ ! -n "$CHECK" ]
then
  echo "ERROR querying $DBNAME"
  exit 2
fi

# echo the result of the query
echo $CHECK

# and check the return status
STATUS=`echo $CHECK | awk '{print $1}'`
if [ $STATUS = "OK" ]
then
  exit 0
else
  exit 2
fi


---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux