Re: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with the shutdown checkpoint

Rui DeSousa <rui@xxxxxxxxxxxxx> · Mon, 5 Nov 2018 11:48:03 -0500

> On Nov 5, 2018, at 6:24 AM, Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> wrote:
> 
> 
> Our current settings are :
> 
> wal_keep_segments = 512
> max_wal_size = 2GB
> min_wal_size = 1GB
> 
> Our setup is as follows :

The settings seem counterintuitive; if you’re using standard 16MB WAL files then keep parameter is at 8GB but max_wal_size is at 2GB — that seems counterproductive to me and would cause more checkpoints than needed.

How often are your checkpoints occurring and why, time or log? What’s your checkpoint_timeout set to? 

> primary (smadb) <--> (no replication slot) physical hot stanbdby (smadb2) (managed via repmgr) <--> (replication slot) barman
>                 ^--> (replication slot) logical subscriber (testsmadb)
>                 ^--> wal archiving to host (sma) (via /usr/bin/rsync -a --delay-updates %p sma:/smadb/pgsql/pitr/%f )

Did you check the status of both the replication slots and archiving status? 

> No ERRORs indication anything with the archive command in the logs,

Postgres is not going log an error if archive command fails; I believe that is up to the your archive command to log the error. 

I would suspect it might have been your archive command.  Could you verify that you have all the WAL files? I’ve seen a case in a 9.2 environment where the startup removed files that were not yet archived thus losing WAL files and breaking the backup.  

It would be great if you can double check to see if have all the WAL files (no gaps) and report back.