On Fri, Dec 10, 2010 at 11:27 AM, Greg Sabino Mullane <greg@xxxxxxxxxxxx> wrote: > Correct. But since we cannot connect to a database in recovery mode, > there are very few options to determine how far 'behind' it is. The > pg_controldata is what the check_postgres program uses. This offers a > rough check which is usually sufficient unless you have a very > inactive database or need very fine grained checking. > > A better system would perhaps connect to both ends and examine which > specific WALs were being shipped and which one was last played, but > there are no tools I know of that do that. I suspect the reason for > this is that the pg_controldata check is "good enough". Certainly, > that's what we are using for many clients via check_postgres, and > it's been very good at detecting when the replica has problems. Good > enough that I've never worried about writing a different method, > anyway. :) Thanks for the reply. One simple piece I added in to my monitoring script which wasn't here: http://www.kennygorman.com/wordpress/?p=249 (or in check_postgres.pl, from a quick look at check_checkpoint() in check_postgres.pl) is a verification that the standby slave is actually 'in archive recovery' mode, from looking at the 'Database cluster state:' output of pg_controldata. I was mulling over some ways to add in a reasonable check that the standby was keeping up with the WAL stream. Comparing WAL file names on master vs. standby would probably work, but I was also thinking that a simple directory-size check on the standby's WAL archive directory would show whether we were receiving WAL files faster than we could process them. Josh -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general