Re: WAL files keep piling up

Zbigniew Kostrzewa <zkostrzewa@xxxxxxxxx> · Wed, 22 Dec 2021 19:11:22 +0100

Thanks for responding. On current master it looks like so:
ls pg_xlog/archive_status/ | grep ready | wc -l
0

ls pg_xlog/archive_status/ | grep done | wc -l
501

Kind regards.

śr., 22 gru 2021 o 19:01 Ninad Shah <nshah.postgres@xxxxxxxxx> napisał(a):
A stupid question. How many .ready files are there?

Regards,
Ninad Shah

On Wed, 22 Dec 2021 at 21:04, Zbigniew Kostrzewa <zkostrzewa@xxxxxxxxx> wrote:
Hi,
I have a PostgreSQL (9.6.15) two node cluster setup with Patroni. The instances are configured to work in master <-> synchronous standby setup and both run in docker containers with pgdata volume mounted from host. When master is restarted the synchronous standby automatically takes the role of master and master starts operating as synchronous replica.

Everything works fine but whenever I restart master instance it creates a new wal file in pg_xlog/ however old wal files are not cleaned up. They pile up and in my reproduction environment, where there are no operations on the database, they currently occupy:
du -sh pg_xlog/
17G     pg_xlog/

and the number of the files is more or less:
ls pg_xlog/ | grep -v history | wc -l
1024

I've searched through the mailing lists and articles like https://dataegret.com/2018/04/pg_wal-is-too-big-whats-going-on/ and the main problems mentioned in most of the places are:
1. failing archive command
2. abandoned (inactive) replication slot
3. issues with checkpoints
3. too big wal_keep_segments value

However, none of those problems seem to apply to my deployment:
1. I have `archive_mode` set to `off`
1a. I even tried enabling it and setting the `archive_command` to '/bin/true' just to confirm a suggestion found in one of the post on the mailing list (that didn't improve anything)
2. the only replication slot in `pg_replication_slots` is the one related to the synchronous replica and it is active
3. I've enabled `log_checkpoint` but doesn't see any errors or warnings related to checkpoints triggered either automatically or manually via `CHECKPOINT;`
4. `wal_keep_segments` is set to 8 and `max_wal_size` is set to 1GB

Is there anything that I should check that could shed some light on this?

A few configuration options taken from `pg_settings`:
archive_command = (disabled)
archive_mode = off
archive_timeout = 0
check_function_bodies = on
checkpoint_completion_target = 0.5
checkpoint_flush_after = 32
checkpoint_timeout = 300
hot_standby = on
hot_standby_feedback = off
max_replication_slots = 10
max_wal_senders = 10
max_wal_size = 64
min_wal_size = 5
synchronous_commit = on
synchronous_standby_names = patroni1
wal_block_size = 8192
wal_buffers = 512
wal_compression = off
wal_keep_segments = 8
wal_level = replica
wal_log_hints = on
wal_receiver_status_interval = 10
wal_receiver_timeout = 60000
wal_retrieve_retry_interval = 5000
wal_segment_size = 2048
wal_sender_timeout = 60000
wal_sync_method = fdatasync
wal_writer_delay = 200
wal_writer_flush_after = 128

Kind regards