You might be onto something. I force quit the pglogical supervisor and other session that had a backend_start of a few months ago. That didn't fix the issue. Additionally, those processes didn't start back up which is making me worry about the status of my logical replication. The subscribers are reporting that they are still replicating but I won't know.
However... when I was checking on the subscribers, I dropped their replication links and recreated them (to see if that would get the pglogical processes to start back up). I checked the pg_log on the master (same server having the vacuum problem) to see if there was anything about the subscriptions resuming and I found these messages in the log:2017-10-31 22:16:42.658 EDT pid=28445 database= user= rhost= tid=3/2970517 sessionid=59f92ded.6f1d WARNING: oldest xmin is far in the past
2017-10-31 22:16:42.658 EDT pid=28445 database= user= rhost= tid=3/2970517 sessionid=59f92ded.6f1d HINT: Close open transactions soon to avoid wraparound problems.
2017-10-31 22:16:45.815 EDT pid=28955 database= user= rhost= tid=3/2970521 sessionid=59f92e8d.711b WARNING: oldest xmin is far in the past
2017-10-31 22:16:45.815 EDT pid=28955 database= user= rhost= tid=3/2970521 sessionid=59f92e8d.711b HINT: Close open transactions soon to avoid wraparound problems
It may also be worth noting that the last time our db was restarted (~6 months ago, same date as the old processes that I killed) was due to the datafile partition filling up.
name setting
autovacuum on
autovacuum_analyze_scale_factor 0.1
autovacuum_analyze_threshold 50
autovacuum_freeze_max_age 200000000
autovacuum_max_workers 3
autovacuum_multixact_freeze_max_age 400000000
autovacuum_naptime 60
autovacuum_vacuum_cost_delay 20
autovacuum_vacuum_cost_limit -1
autovacuum_vacuum_scale_factor 0.2
autovacuum_vacuum_threshold 50
autovacuum_work_mem -1
log_autovacuum_min_duration -1
vacuum_cost_delay 0
vacuum_cost_limit 200
vacuum_cost_page_dirty 20
vacuum_cost_page_hit 1
vacuum_cost_page_miss 10
vacuum_defer_cleanup_age 0
vacuum_freeze_min_age 50000000
vacuum_freeze_table_age 150000000
vacuum_multixact_freeze_min_age 5000000
vacuum_multixact_freeze_table_age 150000000
On Tue, Oct 31, 2017 at 6:12 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
Arjun Ranade <ranade@xxxxxxxxxxxxxxxxx> writes:
> There is nothing in pg_prepared_xacts, however, in pg_stat_activity there
> are two pglogical processes that have a "backend_start" of "2017-06-17"
> when the last time we restarted this server. Both are not waiting and have
> null for "state." This might be expected behavior for pglogical.
I don't know much about pglogical, but if it's holding back the xmin
horizon (as it appears to be doing) that is a really bad pglogical bug.
If you try vacuum verbose (doesn't have to be FULL) on some table that
gets lots of update traffic, does it report a large number of dead-but-
not-removable rows?
> I did try TRUNCATE before but even as the postgres user, but for some
> reason it didn't allow me to truncate because it was a system table.
Mmm. You could get around that, but possibly not without restarting
in single-user mode, which is probably more interruption in service
than you want. It might be easier to force-quit the pglogical sessions.
regards, tom lane