Re: Long running backup preventing auto vacuum

Nikhil Shetty <nikhil.dba04@xxxxxxxxx> · Fri, 1 Jul 2022 14:01:56 +0530

Hi Andrey,
One more thing to add is that, vacuum will not run for catalog tables because of long running pg_stop_backup query

Query:

datname        |  pid  | query_start                   | state  |     runtime     |                            query      
----------------------+-------+---------+------------------+-------------+-----------------+-------------------------------+-----------------------------
 postgres      | 15088 | 2022-07-01 05:56:22.729101+00 | active | 02:12:56.255367 | SELECT labelfile, spcmapfile, lsn FROM pg_stop_backup(false)

Example Vacuum:

<dbname>=# vacuum verbose analyze pg_database;
INFO:  vacuuming "pg_catalog.pg_database"
INFO:  "pg_database": found 0 removable, 5 nonremovable row versions in 1 out of 1 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 996487475
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  analyzing "pg_catalog.pg_database"
INFO:  "pg_database": scanned 1 of 1 pages, containing 5 live rows and 0 dead rows; 5 rows in sample, 5 estimated total rows
VACUUM

Autovacuum from logs of previous issue (timestamp does not match because this is taken from last week logs):

022-06-26 00:00:21 UTC [32632]: [2-1] user=,db=,app=,client= HINT:  Close open transactions soon to avoid wraparound problems.
	You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
2022-06-26 00:00:21 UTC [32632]: [3-1] user=,db=,app=,client= LOG:  automatic aggressive vacuum of table "postgres.pg_catalog.pg_statistic": index scans: 0
	pages: 0 removed, 40 remain, 0 skipped due to pins, 0 skipped frozen
	tuples: 0 removed, 1064 remain, 577 are dead but not yet removable, oldest xmin: 4200201112
	buffer usage: 57 hits, 43 misses, 0 dirtied
	avg read rate: 203.846 MB/s, avg write rate: 0.000 MB/s
	system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2022-06-26 00:00:21 UTC [32632]: [4-1] user=,db=,app=,client= WARNING:  oldest xmin is far in the past

2022-06-26 00:00:21 UTC [32632]: [5-1] user=,db=,app=,client= HINT:  Close open transactions soon to avoid wraparound problems.

Both xmin point to the long running pg_stop_backup. Not quite sure, but would this cause autovacuum to not kick-in on application tables?

Thanks,
Nikhil

On Fri, Jul 1, 2022 at 1:11 PM Nikhil Shetty <nikhil.dba04@xxxxxxxxx> wrote:
Hi Andrey,
./wal-g7 --version
wal-g version v1.1.2-rc	6af461f	2021.12.14_08:19:15	PostgreSQL

The only probable problem I can imagine is that we open transaction during pg_stop_backup() [0] to set a statement_timeout
Do you think this will prevent the autovacuum from kicking in ? 

I think this tx can be safely removed
How can I remove this? I can test whether this is the problem 

Thanks,
Nikhil

On Thu, Jun 30, 2022 at 5:18 PM Andrey Borodin <x4mmm@xxxxxxxxxxxxxx> wrote:
Hi!

> On 30 Jun 2022, at 14:23, Nikhil Shetty <nikhil.dba04@xxxxxxxxx> wrote:

> 

> Hi,

> 

> PostgreSQL version - 11.7

> 

> We are using wal-g for backup/restore. We had a duration of high WAL generation due to application usage due to which archive backup was delayed. 

> 

> Backup was started through cron schedule and it didn't complete because it waits for the last WAL to be archived. Now, I know that long running transactions will prevent vacuuming of dead tuples but in this case we were able to run manual vacuums successfully but auto-vacuum never starts.

> 

> As soon as we terminated the long running backup session, the auto-vacuum started kicking in.

> 

> Wanted to understand why a backup operation is blocking the auto-vacuum?

> 

> Wal-g uses non-exclusive backups and waits for the last wal to be archived.

As far as I know, WAL-G should not prevent running autovacuums on tables. WAL-G can be run against replication standby, where vacuum is not possible at all, so there must be no conflicts...

The only probable problem I can imagine is that we open transaction during pg_stop_backup() [0] to set a statement_timeout. I think this is superflous and unneeded. I think this tx can be safely removed. But anyway, this tx does not have a xid (otherwise it could not run on Standby).

BTW which version of WAL-G do you use?

Best regards, Andrey Borodin.

[0] https://github.com/wal-g/wal-g/blob/8b9b3ca06efb44be4724208dc6f6d11836bd34d9/internal/databases/postgres/queryRunner.go#L243