On 5/11/18 6:48 μ.μ., Rui DeSousa
wrote:
This email was not sent in order to discuss WAL GUC settings but rather the actual problem in hand.On Nov 5, 2018, at 6:24 AM, Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> wrote: Our current settings are : wal_keep_segments = 512 max_wal_size = 2GB min_wal_size = 1GB Our setup is as follows :The settings seem counterintuitive; We are using postgresql since 2001 and this is the first time we faced such a (rather serious) issue. wal_keep_segments cover the case where one has to provide some safety (by keeping at least this num of wals) for replication clients when replication slots are not in use.if you’re using standard 16MB WAL files then keep parameter is at 8GB but max_wal_size is at 2GB — that seems counterproductive to me and would cause more checkpoints than needed. checkpoint_timeout / max_wal_size control checkpoints. All of them + other conditions are used in the algorithm which decides how many files to keep in pg_wal. That's what I am trying to figure out here. How often are your checkpoints occurring and why, time or log? What’s your checkpoint_timeout set to?primary (smadb) <--> (no replication slot) physical hot stanbdby (smadb2) (managed via repmgr) <--> (replication slot) barman ^--> (replication slot) logical subscriber (testsmadb) ^--> wal archiving to host (sma) (via /usr/bin/rsync -a --delay-updates %p sma:/smadb/pgsql/pitr/%f )Did you check the status of both the replication slots and archiving status?No ERRORs indication anything with the archive command in the logs,Postgres is not going log an error if archive command fails; I believe that is up to the your archive command to log the error. No, PostgreSQL will complain. Normally (in 10) you get something like : LOG: archive command failed with exit code ..... In previous versions the LOG level were even more severe IIRC. Absolute continuity.I would suspect it might have been your archive command. Could you verify that you have all the WAL files? I’ve seen a case in a 9.2 environment where the startup removed files that were not yet archived thus losing WAL files and breaking the backup. It would be great if you can double check to see if have all the WAL files (no gaps) and report back. Remember : postgresql checkpointer decided to remove 5000+ files before shutdown. If any conditions were keeping those files afloat should also hold at this point, right. The question is why didn't Postgresql removed them earlier. -- Achilleas Mantzios IT DEV Lead IT DEPT Dynacom Tankers Mgmt |