Re: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with the shutdown checkpoint

Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> · Tue, 6 Nov 2018 10:37:37 +0200

    On 5/11/18 6:48 μ.μ., Rui DeSousa
      wrote:

        On Nov 5, 2018, at 6:24 AM, Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> wrote:

Our current settings are :

wal_keep_segments = 512
max_wal_size = 2GB
min_wal_size = 1GB

Our setup is as follows :

      The settings seem counterintuitive; 

    This email was not sent in order to discuss WAL GUC settings but
    rather the actual problem in hand.

    We are using postgresql since 2001 and this is the first time we
    faced such a (rather serious) issue.

      if you’re using standard 16MB WAL files then keep parameter is at 8GB but max_wal_size is at 2GB — that seems counterproductive to me and would cause more checkpoints than needed.

    wal_keep_segments cover the case where one has to provide some
    safety (by keeping at least this num of wals) for replication
    clients when replication slots are not in use. 

    checkpoint_timeout / max_wal_size control checkpoints. All of them +
    other conditions are used in the algorithm which decides how many
    files to keep in pg_wal. That's what I am trying to figure out here.

How often are your checkpoints occurring and why, time or log? What’s your checkpoint_timeout set to? 

        primary (smadb) <--> (no replication slot) physical hot stanbdby (smadb2) (managed via repmgr) <--> (replication slot) barman
                ^--> (replication slot) logical subscriber (testsmadb)
                ^--> wal archiving to host (sma) (via /usr/bin/rsync -a --delay-updates %p sma:/smadb/pgsql/pitr/%f )

Did you check the status of both the replication slots and archiving status? 

        No ERRORs indication anything with the archive command in the logs,

Postgres is not going log an error if archive command fails; I believe that is up to the your archive command to log the error. 

    No, PostgreSQL will complain. Normally (in 10) you get something like : LOG:  archive command failed with exit code .....
    In previous versions the LOG level were even more severe IIRC.

      I would suspect it might have been your archive command.  Could you verify that you have all the WAL files? I’ve seen a case in a 9.2 environment where the startup removed files that were not yet archived thus losing WAL files and breaking the backup.  

It would be great if you can double check to see if have all the WAL files (no gaps) and report back.

    Absolute continuity.

    Remember : postgresql checkpointer decided to remove 5000+ files
    before shutdown. If any conditions were keeping those files afloat
    should also hold at this point, right.

    The question is why didn't Postgresql removed them earlier.

    -- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt