Re: WAL segement issues on both master and slave server

Payal Singh <payal@xxxxxxxxxx> · Thu, 19 Oct 2017 13:49:54 -0400

On Thu, Oct 19, 2017 at 1:25 PM, Chris Kim <chrisk@xxxxxxxxxxx> wrote:
Hi there,

I am running into an issue with the number of files that reside on the pg_xlog directory of my compliance database server (This one is the master server in our master-slave setup). Sometime earlier this year, I modified the location of the PITR directory and that caused an issue with WAL segments not being sent to the correct location and crashing the DB. I went ahead and fixed that up so that it points to the correct location but since then the number of files on the pg_xlog directory went up from around 898 to 1025. I didn't have a chance to look in to this issue until now so my question is do you know if there is an easy way to clean up some of these files in the pg_xlog directory safely? I believe that there might be some orphaned files there and would like to clean those up.

How is the replication being done? Is the replica in sync with master? Check for lag on replica and replication byte lag on master, and if they are in sync, an `ls -l | less` in wal directory should show you which older files are being kept. Do check in both master and replica postgres and archive logs for any ERROR or FATAL messages before you remove any files though. As an extra precaution, you can just move the older files to another location where postgres can't access it, and if something breaks, you can move them back. If all looks good after moving, you can delete the files you moved. 

Would highly recommend having a monitor in place to track # of WALs in the WAL directory and alerting if too high. 

Also, on the Standby, the pg_xlog directory appears like it is growing on a daily basis. The WAL files are being cleaned up but I don't believe at a fast enough rate. This directory is approximately over 650GB in size and I would like to revisit if any of the parameters will need to be changed in the postgresql.conf file since it's almost 5 years since I last touched this.

Let me know if you need more details to clarify.

Thanks.

Again, this might be a sign that replication is lagging. If your cleanup command is correct and related logs have nothing suspicious, checking the replication lag would be a good first step to determine the cause. 

Thanks,
Payal Singh,
Database Administrator,OmniTI Computer Consulting Inc.
Phone: 240.646.0770 x 253