Hi, Starting this morning at 0830 local time I noticed that my datfrozenxid starts moving past the `autovacuum_freeze_max_age` value of 200000000. When we encountered this in the past the solution has been to do one of the following: 1. This is related an error similar to ``` found xmin 2675436435 from before relfrozenxid 321165377 ``` Where the solution has been to move the `pg_internal.init` file out of the way and let Postgresql recreate it. Or; 2. A long-running transaction. Typically I'll just find the `idle in transaction` transactions that have a `query_start` around when my alarm went off notifying me when `datfrozenxid` breaches `autovacuum_freeze_max_age`. Using a query similar to ``` SELECT pid, query_start, datname, usename, state, backend_xmin, age(backend_xmin) FROM pg_stat_activity WHERE state = 'idle in transaction'; ``` 3. The autovacuum process seemed to be "stuck" on a particular table. We would kill the pid of the autovacuum process. The problem is that neither of these solutions have seemed to drop `datfrozenxid` back down and there is one specific database in this cluster that's holding onto it. Using these queries from CrunchyData: # Show oldest current xid # WITH max_age AS ( SELECT 2000000000 as max_old_xid , setting AS autovacuum_freeze_max_age FROM pg_catalog.pg_settings WHERE name = 'autovacuum_freeze_max_age' ) , per_database_stats AS ( SELECT datname , m.max_old_xid::int , m.autovacuum_freeze_max_age::int , age(d.datfrozenxid) AS oldest_current_xid FROM pg_catalog.pg_database d JOIN max_age m ON (true) WHERE d.datallowconn ) SELECT max(oldest_current_xid) AS oldest_current_xid , max(ROUND(100*(oldest_current_xid/max_old_xid::float))) AS percent_towards_wraparound , max(ROUND(100*(oldest_current_xid/autovacuum_freeze_max_age::float))) AS percent_towards_emergency_autovac FROM per_database_stats; oldest_current_xid | percent_towards_wraparound | percent_towards_emergency_autovac --------------------+----------------------------+----------------------------------- 230935699 | 12 | 115 # Top 8 individual databases SELECT datname , age(datfrozenxid) , current_setting('autovacuum_freeze_max_age') FROM pg_database ORDER BY 2 DESC LIMIT 8; datname | age | current_setting ---------------------------+-----------+----------------- siteservice | 230935699 | 200000000 coupon-ws | 217613246 | 200000000 contextchangestore | 211343280 | 200000000 template0 | 210351639 | 200000000 productmanager | 207876167 | 200000000 rhistory | 207876167 | 200000000 smsservice | 207876167 | 200000000 techservice | 205432524 | 200000000 That top database `siteservice` is the "problematic" one where a manual vacuum won't coerce it to free up the `datfrozenxid`. Looking at the tables in that database: # SELECT c.oid::regclass , age(c.relfrozenxid) , pg_size_pretty(pg_total_relation_size(c.oid)) FROM pg_class c JOIN pg_namespace n on c.relnamespace = n.oid WHERE relkind IN ('r', 't', 'm') AND n.nspname NOT IN ('pg_toast') ORDER BY 2 DESC LIMIT 10; oid | age | pg_size_pretty --------------------+-----------+---------------- pg_database | 230935699 | 4264 kB pg_proc | 93543215 | 976 kB pg_collation | 93543215 | 560 kB pg_attribute | 93543215 | 600 kB pg_shdepend | 59515320 | 15 MB pg_statistic | 53828900 | 464 kB pg_subscription | 53172718 | 16 kB pg_pltemplate | 53172718 | 56 kB pg_authid | 53172718 | 8616 kB pg_db_role_setting | 53172718 | 64 kB I thought maybe it had to do with my replication slots somehow: # select slot_name, slot_type, database, active, catalog_xmin, restart_lsn, confirmed_flush_lsn from pg_replication_slots ; slot_name | slot_type | database | active | catalog_xmin | restart_lsn | confirmed_flush_lsn ----------------------------------------+-----------+-------------------------------+--------+--------------+---------------+--------------------- dbs1db02 | physical | | f | | | dbs1db01 | physical | | t | | 4D25/ACE6EE08 | dbs1db03 | physical | | t | | 4D25/ACE6EE08 | dbs2db01 | physical | | t | | 4D25/ACE6EE08 | debezium_cmanager | logical | campaign-manager | t | 2152258063 | 4D25/A421A6C8 | 4D25/ABC18C88 debezium_rservice | logical | retail-content-service | t | 2152238060 | 4D25/8EC403B0 | 4D25/A6105DF8 debezium_partnerservice | logical | partnerservice | t | 2152238060 | 4D25/8EC403B0 | 4D25/A5446630 dbs1_dr | physical | | f | | | So I either restarted the physical standbys or I restarted the logical connections (Debezium -> Kafka). I'm also working on a: # vacuumdb --all --jobs=10 --echo --analyze But this is still running (and honestly I'm not really hopeful it's going to do the trick). My primary, read/write database is Postgresql 10.4 (CentOS 7) while my standby databases have been patched to 10.17. Any thoughts on this? Thanks,Matt