VACUUM: Nonremovable rows due to wal sender process

Steve Nixon <inverasln@xxxxxxxxx> · Tue, 4 Jan 2022 12:16:00 -0500

I have PostgreSQL Version 10.7 on AIX 7.1 set up with streaming replication. Replication appears to be working fine and database contents are staying current.
ps -ef |grep sender
postgres 54854022 30212254   0 10:10:29      -  0:00 postgres: wal sender process postgres 10.253.15.123(47852) streaming 54/BB631A30

ps -ef |grep receiver
postgres 34079622  9897420   0 10:10:29      -  0:00 postgres: wal receiver process   streaming 54/BB631A30

The problem I have is related to the wal sender process. The AUTOVACUUM and VACUUM are not cleaning up dead tuples in the tables because it is reporting that they are "nonremovable" due to the backend_xmin that is not changing. This has resulted in queries on some tables taking seconds or minutes to return under 100 tuples that should take 5ms or less.

VACUUM VERBOSE scttlk_tbl;

INFO:  "scttlk_tbl": found 0 removable, 149715 nonremovable row versions in 3322
out of 12152 pages
DETAIL:  149699 dead row versions cannot be removed yet, oldest xmin: 340818216
There were 21246 unused item pointers.
Skipped 0 pages due to buffer pins, 8830 frozen pages.

When I check the backend_xmin that is indicated as preventing the dead tuples from being removed, the PID it points to is the wal sender.

SELECT pid, datname, usename, state, backend_xid, backend_xmin
FROM pg_stat_activity WHERE backend_xmin = 340818216;

   pid    |   datname    | usename  | state  | backend_xid | backend_xmin
----------+--------------+----------+--------+-------------+--------------
54854022 |          | postgres    | active |             |    340818216

I have determined that if I shut down the replication database, the wal sender process will shut down. When I do this and run my VACUUM, it is then able to remove the dead tuples that were nonremovable prior. However, when I restart the replication database, the wal sender becomes active again and tries to pick up where it left off, at the same backend_xmin. 

I believe the issue may be related to another product we are using as part of the replication process called "Attunity". But we have shut that down and restarted it to make sure it did not have any long running queries or other hooks that may be affecting the wal sender and preventing the backend_xmin from moving forward. It just does not seem to do so.

My questions are as follows:

1) Is there anything I can do short of shutting down and restarting the primary (production system) that would allow the backend_xmin to move forward?

2)  Is it possible to "kill" the WAL sender process? I know it's possible, but what I mean is will it crash Postgres doing that? Or will it simply respawn?

Ultimately, the goal is to get backend_xmin to be caught up to work being done today and not waiting on something from days or weeks ago to release so the autovacuum can take place.

Hope I'm explaining myself right! Please let me know any advice you may have on this, and thanks in advance for any tips on where to look or how to address this.

Regards,

Steve N.