Enhancement Request

Rob Brucks <rob.brucks@xxxxxxxxxxxxx> · Tue, 19 Apr 2016 20:37:02 +0000

I'd like to propose two enhancements to the PostgreSQL code, but I'm not sure if this is the correct mailing list.  So if it's not then please let me know where I need to post this.

These are monitoring-centric enhancement requests since I'm trying to implement accurate monitoring in a secure fashion. 

* General monitoring:
We have a need for a "monitoring" role in PostgreSQL that has read-only access to any "pg_stat" view.  As of 9.4, only a super-user can read all columns of "pg_stat_activity", "pg_stat_replication", and "pg_stat_archiver" (there may be other restricted
 views as well).  These views provide critical insight on how well the cluster is operating and what is going on.

There appears to be only two ways to gain access to these views: 

grant super-user to my monitoring user
write custom functions owned by a super-user with "SECURITY DEFINER" and grant access to my monitoring user
Option 1 is too big of a security risk to grant to a monitoring user.  If the monitoring system gets compromised then the DB will be at risk too.

Option 2 requires creating, deploying, and maintaining sets of monitoring functions, which is a decent chunk of work in a large environment with dozens or hundreds of deployments, many running different versions of postgres possibly needing custom versions
 of the functions.  When you add the bureaucracy of a large IT organization and SOX and PCI compliance requirements it ends up being a PITA implementing or changing these functions when you only have a small DBA team.

* Streaming Replication Monitoring:
Make the "pg_stat_replication" view more persistent (maybe keep the rows for 24 hours or have a registration process?).

There appears to be no way to *accurately* monitor streaming replication via SQL alone currently.  This is due to three different problems:

When a standby gets disconnected from the master then the corresponding row is immediately removed from pg_stat_replication on the master.  Once the row is gone you cannot tell via simple SQL whether a standby is behind or not (or that it even existed at
 all) without storing prior values and extrapolating from them.
On the standby, if the WAL streaming process gets disconnected from the master then it does not report that it is behind because pg_last_xlog_receive_location() has not been updated from the master.  The standby has no idea how far ahead the master has
 gotten and just blindly reports the last value.
On a "quiet" system there may not be any update activity on the master for a long time, which makes the pg_last_xact_replay_timestamp() function report an ever-increasing interval.  So it is not useable for accurately measuring lag on quiet systems.
Ideally the master should be able to report standby lag time via SQL, even when there has been a disruption in connectivity with a standby.

The only accurate method I have found to measure standby lag is to create a synthetic update that runs periodically.  This works, but is less than ideal and requires adding a table to every cluster (which then has to be vacuumed frequently too) and writing
 and maintaining a process to update it and purge it.

These two enhancements would go a long way in making it easier to monitor PostgreSQL clusters and replication because it would eliminate a lot of custom coding requirements and enable us to pull metrics directly via simple SQL.

If anyone can provide insight on how I could accomplish these in a simple manner by other means then I'm all ears!

Thanks,
Rob