Hello,
Over the weekend we noticed that our max transaction IDs have been continuously increasing - got an alert it passed 1B - and we believe that no autovacuums were running for a period of about a month by looking at pg_stat_user_tables. We had not updated any autovac tuning parameters over that time period and many tables were very much over the threshold for needing an autovac.
When investigating I located the table with the oldest transaction ID with:
SELECT c.oid::regclass as table_name,
greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age,
pg_size_pretty(pg_table_size(c.oid)) as table_size,
c.relkind,
c.relpersistence
FROM pg_class c
LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
WHERE c.relkind in ('r', 't','m')
order by age desc limit 40;
I vacuumed that table manually with `vacuum freeze verbose table_xx` and got this error:
INFO: aggressively vacuuming "public.table_xx"
INFO: scanned index "table_xx_pkey" to remove 168443 row versions
DETAIL: CPU: user: 0.13 s, system: 0.00 s, elapsed: 0.13 s
INFO: scanned index "table_xx_col_id" to remove 168443 row versions
DETAIL: CPU: user: 0.16 s, system: 0.00 s, elapsed: 0.16 s
ERROR: failed to re-find parent key in index "table_xx_col_idx" for deletion target page 217
I replaced the index (create a new concurrently, delete the old concurrently), vacuumed the table, and immediately autovacs started across the system and our XIDs started falling. To me it looks like a *single* corrupt index held up autovacuums across our entire server, even other in other databases on the same server. Am I interpreting this correctly? Would love guidance on diagnosing this type of thing and strategies for preventing it.
Thanks,
Aaron
Over the weekend we noticed that our max transaction IDs have been continuously increasing - got an alert it passed 1B - and we believe that no autovacuums were running for a period of about a month by looking at pg_stat_user_tables. We had not updated any autovac tuning parameters over that time period and many tables were very much over the threshold for needing an autovac.
When investigating I located the table with the oldest transaction ID with:
SELECT c.oid::regclass as table_name,
greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age,
pg_size_pretty(pg_table_size(c.oid)) as table_size,
c.relkind,
c.relpersistence
FROM pg_class c
LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
WHERE c.relkind in ('r', 't','m')
order by age desc limit 40;
I vacuumed that table manually with `vacuum freeze verbose table_xx` and got this error:
INFO: aggressively vacuuming "public.table_xx"
INFO: scanned index "table_xx_pkey" to remove 168443 row versions
DETAIL: CPU: user: 0.13 s, system: 0.00 s, elapsed: 0.13 s
INFO: scanned index "table_xx_col_id" to remove 168443 row versions
DETAIL: CPU: user: 0.16 s, system: 0.00 s, elapsed: 0.16 s
ERROR: failed to re-find parent key in index "table_xx_col_idx" for deletion target page 217
I replaced the index (create a new concurrently, delete the old concurrently), vacuumed the table, and immediately autovacs started across the system and our XIDs started falling. To me it looks like a *single* corrupt index held up autovacuums across our entire server, even other in other databases on the same server. Am I interpreting this correctly? Would love guidance on diagnosing this type of thing and strategies for preventing it.
Thanks,
Aaron