Nagios plugin to check slony replication

John Sidney-Woollett <johnsw@xxxxxxxxxxxxx> · Sun, 27 Feb 2005 15:58:05 +0000

I've finally got around to writing the two nagios plugins which I am 
using to check our slony cluster (on our linux servers). I'm posting 
them in case anyone else wants them or to use them as a basis for 
something else. These are based on Christopher Browne's scripts that 
ship with slony.

The two scripts perform different tasks.

check_slon checks to see that the slon daemon is in the proces list and 
optionally checks for any error or warning messages in the slon log file

it is called using two or three parameters; the clustername, the dbname 
and (optionally) the location of the log file. This script is to be 
executed on each node in the cluster (both master and slaves)

check_sloncluster checks that active receiver nodes are comfirming sync 
within 10 seconds of the master. I'm not entirely sure that this is the 
best strategy, and if you know otherwise, I'd love to hear. Requires two 
parameters;  the clustername and the dbname. This script is executed on 
the master database only.

These scripts are designed to run on the host on which they are 
checking. With a little modification, they could check remote servers on 
the network. They are quite simplistic and may not be suitable for your 
environment. You are free to modify the code to suit your own needs.

John Sidney-Woollett

check_slon
==========

#!/bin/sh

# nagios plugin that checks whether the slon daemon is running
# if the 3rd parameter (LOGFILE) is specified then the log file is
# checked to see if the last entry is a WARN or FATAL message
#
# three possible exit statuses:
#  0 = OK
#  1 = Warning (warning in slon log file)
#  2 = Fatal Error (slon not running, or error in log file)
#
# script requires two or three parameters:
# CLUSTERNAME - name of slon cluster to be checked
# DBNAME - name of database being replicated
# LOGFILE - (optional) location of the slon log file
#
# Author:  John Sidney-Woollett
# Created: 26-Feb-2005
# Copyright 2005

# check parameters are valid
if [[ $# -lt 2 && $# -gt 3 ]]
then
  echo "Invalid parameters need CLUSTERNAME DBNAME [LOGFILE]"
  exit 2
fi

# assign parameters
CLUSTERNAME=$1
DBNAME=$2
LOGFILE=$3

# check to see whether the slon daemon is running

SLONPROCESS=`ps -auxww | egrep "[s]lon $CLUSTERNAME" | egrep 
"dbname=$DBNAME" | awk '{print $2}'`

if [ ! -n "$SLONPROCESS" ]
then
  echo "no slon process active"
  exit 2
fi

# if the logfile is specified, check it exists
# and check for the word ERROR or WARN in the last line
if [ -n "$LOGFILE" ]
then
  # check for log file
  if [ -f "$LOGFILE" ]
  then
    LOGLINE=`tail -1 $LOGFILE`
    LOGSTATUS=`tail -1 $LOGFILE | awk '{print $1}'`
    if [ $LOGSTATUS = "FATAL" ]
    then
      echo "$LOGLINE"
      exit 2
    elif [ $LOGSTATUS = "WARN" ]
    then
      echo "$LOGLINE"
      exit 1
    fi
  else
    echo "$LOGFILE not found"
    exit 2
  fi
fi

# otherwise all looks to be OK
echo "OK - slon process $SLONPROCESS"
exit 0

check_sloncluster
=================

#!/bin/sh

# nagios plugin that checks whether the slave nodes in a slony cluster
# are being updated from the master
#
# possible exit statuses:
#  0 = OK
#  2 = Error, one or more slave nodes are not sync'ing with the master
#
# script requires two parameters:
# CLUSTERNAME - name of slon cluster to be checked
# DBNAME - name of master database
#
# Author:  John Sidney-Woollett
# Created: 26-Feb-2005
# Copyright 2005

# check parameters are valid
if [[ $# -ne 2 ]]
then
  echo "Invalid parameters need CLUSTERNAME DBNAME"
  exit 2
fi

# assign parameters
CLUSTERNAME=$1
DBNAME=$2

# setup the query to check the replication status
SQL="select case
  when ttlcount = okcount then 'OK - '||okcount||' nodes in sync'
  else 'ERROR - '||ttlcount-okcount||' of '||ttlcount||' nodes not in sync'
end as syncstatus
from (
-- determine total active receivers
select (select count(distinct sub_receiver)
    from _$CLUSTERNAME.sl_subscribe
    where sub_active = true) as ttlcount,
(
-- determine active nodes syncing within 10 seconds
 select count(*) from (
  select st_received, st_last_received_ts - st_last_event_ts as cfmdelay
  from _$CLUSTERNAME.sl_status
  where st_received in (
    select distinct sub_receiver
    from _$CLUSTERNAME.sl_subscribe
    where sub_active = true
  )
) as t1
where cfmdelay < interval '10 secs') as okcount
) as t2"

# query the master database

CHECK=`/usr/local/pgsql/bin/psql -c "$SQL" --tuples-only -U postgres 
$DBNAME`

if [ ! -n "$CHECK" ]
then
  echo "ERROR querying $DBNAME"
  exit 2
fi

# echo the result of the query
echo $CHECK

# and check the return status
STATUS=`echo $CHECK | awk '{print $1}'`
if [ $STATUS = "OK" ]
then
  exit 0
else
  exit 2
fi

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
     joining column's datatypes do not match